Site Reliability Engineer - xAI Technical Operations
$180k - $400kxAI
About xAI
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers and researchers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.
About the Role
AI is building at a furious pace with the latest hardware to help people understand the universe and we are in need of Site Reliability Engineers (SREs) who have at least 8+ years of experience in distributed, internet-scale environments, including on-prem and cloud-based infrastructure.
You will own the availability and reliability of xAI's infrastructure and core services, including detecting issues, problem management, incident management, and root cause analyses (RCAs). Engineers will own the availability of xAI infrastructure and its operations processes applying concepts like failure domains, blast radii, and canary testing. You will be expected to participate in a team on-call rotation and to contribute to ushering xAI into the next generation of infrastructure management across multiple data centers and cloud environments.
Responsibilities Will Include
- Setting technical strategy and roadmap for infrastructure availability.
- Automating monitoring, alerting, and troubleshooting for high-availability services, while working with legacy systems to scale, improve, or deprecate.
- Owning incident response, problem management, and conducting thorough RCAs to prevent recurrence and drive continuous improvement.
- Analyzing performance metrics and service health to identify, resolve, and mitigate bottlenecks or failures in distributed environments.
- Ensuring security, scalability, and resilience of production infrastructure supporting AI workloads.
Location
Work will be in-office based out of either Palo Alto, California or Dublin, Ireland.
Required Qualifications
- A minimum of 8 years of software, systems or reliability engineering experience.
- Experience managing services in distributed, internet-scale *nix environments, including on-prem and cloud (e.g., AWS, GCP).
- Development experience in Python, Scala, Java, C, or C++.
- Demonstrable knowledge of TCP/IP, Networking and systems programming (e.g., bash and shell tools).
- Familiarity with containerization and orchestration tools (e.g., Kubernetes, Docker, Mesos) and systems management (e.g., Puppet, Chef, Ansible).
- Bachelor's degree in Computer Science, Electrical Engineering, or a related field (or equivalent experience).
Preferred Experiences
- Experience in on-call rotations and incident response in high-stakes environments.
- Experience with AI/ML infrastructure, large-scale GPU clusters
- Strong problem-solving skills and ability to thrive in a fast-paced, ambiguous setting.
- Comfortable with deployment, support, monitoring, administration, and troubleshooting across on-prem, cloud and hybrid infrastructures.
- Proven understanding of systems and application design, including operational trade-offs.
Interview Process
After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to an initial interview (45 minutes - 1 hour) during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of four interviews:
- Coding assessment in a language of your choice.
- Site reliability and operations technologies.
- Manager Interview.
- Meet and greet with the team with a presentation of a large scale solution or problem you owned, start to finish.
Our goal is to finish the main process within one week. We don’t rely on recruiters for assessments. Every application is reviewed by a member of our technical team. All interviews will be conducted via Google Meet.
Annual Salary Range
$180,000 - $400,000 USD
Benefits
Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.
xAI is an equal opportunity employer.
- ...greatest potential. Title and Summary Site Reliability Engineer I-1 The Next Edge BizOps team is... ...automate everything you can? Business Operations is leading the DevOps transformation... ...in Computer Science or related technical field involving coding (e.g., physics...OperationsFull timeWorldwideShift work
- ...potential. Title and Summary Senior Site Reliability Engineer Who is Mastercard? At... ...next. About the Role The Business Operations team is seeking a highly motivated and... ...leader in your field, you will bring technical expertise, a passion for automation,...OperationsFull timeWorldwide
- ...greatest potential. Title and Summary Site Reliability Lead Engineer Lead Site Reliability Engineer... .... About the Role The Business Operations (Biz Ops) team is seeking a Business... ...in your field, you will bring technical expertise, a passion for automation...OperationsFull timeWorldwide
- ...potential. Title and Summary Director, Site Reliability Engineering Who is Mastercard? Mastercard... ...this role will focus on leading our operational presense in Europe as well as owning... ...candidate will have strong hands on technical experience across our core...OperationsFull timeWorldwide
- ...potential. Title and Summary Director, Site Reliability Engineering Director, Site Reliability... ...mission is to ensure these components operate with excellence, enabling applications... ...platform roadmaps. • Provide strong technical leadership by driving high level architectural...OperationsFull timeWorldwide
- ...potential. Title and Summary Manager, Site Reliability Engineer Who is Mastercard? At... ...next. About the Role The Business Operations team is seeking a highly motivated and... ...leader in your field, you will bring technical expertise, a passion for automation,...OperationsFull timeWorldwideShift work
- ...their greatest potential. Title and Summary Lead, SRE Engineer Lead SRE Engineer, Site Reliability Engineering Our Purpose: Mastercard powers... ...applications. Our mission is to ensure these components operate with excellence, enabling applications to deliver an...OperationsFull timeWorldwide
- ...greatest potential. Title and Summary Lead SRE Network Engineer Lead Network Engineer, Site Reliability Engineering Our Purpose: Mastercard powers... .... Our mission is to ensure these components operate with excellence, enabling applications to deliver an...OperationsFull timeWorldwide
- ...greatest potential. Title and Summary Senior Software Engineer Overview Be part of the Operations & Technology Fraud Products team developing new... ...direct development of software. • Work closely with technical leads for assigned projects to assist in design and...OperationsFull timeWorldwide
- ...potential. Title and Summary Software Engineer II Overview The Virtual Card Management... ...functional requirements into technical solutions, ensuring alignment with project... ...ability to understand internal business operations and how technical work connects to customer...OperationsFull timeWorldwide
- ...Title and Summary Software Engineer – DevOps / SRE Overview... ...Engineer II with emphasis on site reliability to support and evolve our Authentication... .... Our focus is on operating highly resilient systems in... ...Business Analyst, Systems Analyst, Technical leads and other developers...OperationsFull timeWorldwide
- ...and Summary Senior Software Engineer Overview The Program... ...technology, risk, and service operations. We establish the foundation... ...security, performance, and reliability. Role & Responsibilities... ...environments Contribute to technical design discussions, mentor...OperationsFull timeWorldwide
- ...and Summary Principal DevOps Engineer - Decision Management... ...and driving engineering and operational excellence across a critical... ...• Drive observability and reliability through monitoring, logging,... ...including at least 2 years in a technical leadership capacity. • Strong...OperationsFull timeWorldwide
- ...Title and Summary Software Engineer II in Test (SDET) Who is Mastercard... ...to join the Decision Operations team in Dublin. This role... ...our decisioning systems are reliable, scalable, and secure. Role... ...code, and contribute to technical documentation. Advocate for...OperationsFull timeWorldwide
- ...and Summary Lead Software Engineer Lead Software Engineer... ...C-Suite. This is a hands-on technical leadership role for an experienced... ...of secure, scalable, and reliable agentic applications that can... ...testing, release, and production operations • Use engineering tools to...OperationsFull timeTemporary workWorldwide
- ...and Summary Lead Software Engineer Overview: Mastercard is... ...and test engineers to align technical and business goals. • Perform... ...with enterprise security, operations, and architecture standards.... ...application performance and reliability for large-scale, high-...OperationsFull timeWorldwide3 days per week
- ...potential. Title and Summary AI engineer II Who is Mastercard?... ...II to support the build and operation of applied AI solutions. This... ..., and learning how to build reliable, scalable AI systems in a... ...• Strong interest in growing technical depth in AI engineering and ML...OperationsFull timeWorldwide
- ...Title and Summary Senior AI Engineer-1 Who is Mastercard?... ...strong focus on building and operating production-grade AI systems.... ...as needed Participate in technical design reviews and support knowledge... ...standards for performance, reliability, security and governance...OperationsFull timeWorldwide
- ...potential. Title and Summary Lead Network Engineer-2 Overview The Data Center and... ...Platform Engineer to spearhead our Telcom Operations team forward by consistently innovating... ...troubleshoot issues. Key Responsibilities Technical Leadership & Troubleshooting • Serve...OperationsFull timeWorldwide
- ...Summary Director, Software Engineering Director, Software Engineering... ...on engineering, influence technical direction, and partner... ...software platforms where agents operate as intelligent personas,... ...agentic concepts into secure, reliable, observable, and production-...OperationsFull timeWorldwide
- ...Title and Summary Manager, Software Engineering Overview The Corporate Client Experience... ...applications. • Work closely with technical leads, architects, and product owners... ...etc. • Provide automation tests for operations teams to use in Ci/Cd automated quality...OperationsFull timeWorldwide
- ...their greatest potential. Title and Summary Lead Technical Program Manager Overview Be part of the Operations & Technology Fraud Products team developing new... ...current processing and work with analysts and engineers to ensure accuracy of enhancements. Document...OperationsFull timeWorldwide
- ...platform. We’re hiring a Senior Software Engineer II to join the Flywheel Context team - a... ...What You’ll Do Design, build, and operate backend services that power context retrieval... ...frameworks to ensure context accuracy, reliability, and performance. Collaborate closely...OperationsRemote jobWork at office
- ...and Summary Manager, Product Management-Technical Manager, Product Management-... ...Product Strategy, Product Management, Engineering, Customer delivery, Support chain community... ...dependent applications/services, runtime operations (i.e. trouble management/associated support...OperationsFull timeWorldwide
- ...their greatest potential. Title and Summary Senior Software Engineer Overview The Mastercard Fraud Scoring and Analytics Platform... ...with Product Owners, Business Analyst, Systems Analyst, Technical leads and other developers to define user stories. • Work Quality...Full timeWork experience placementWorldwide
- ...and Summary Senior Software Engineer Senior Software Engineer... ...emerging technologies into secure, reliable, and reusable capabilities... ..., release engineering, and operational support • Use modern tools... ...business impact • Drive technical innovation by evaluating...Full timeWorldwide
- ...potential. Title and Summary Senior Software Engineer in Test The Mastercard Consumer Data... .... • Work closely with Product Owners, Technical leads and other developers to define... ...business needs. • Automate build, operate, and run aspects of software Skills:...Full timeWork experience placementWorldwide
- ...and Summary Lead Software Engineer - Distributed Microservices... ...overview This role combines technical leadership, system design,... ...capabilities at scale. The role operates across a mixed architecture... ...on platform resilience, reliability, and safe evolution while...Full timeWorldwide
- ...potential. Title and Summary Senior Platform Engineer - Linux Overview: Linux Systems... ...Administrator, Platform Support to provide support of technical hardware and software expertise in support of MasterCard Linux Operating systems and platforms. This is a senior-...Full timeWorldwide
- ...Title and Summary Principal Software Engineer Who is Mastercard? Mastercard is a... ...responsible for designing, building, and operating the technology that powers Mastercard’s... ...teams. You will also lead by defining the technical strategy, architecture, design, and execution...Full timeWork experience placementWorldwide
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Site Reliability Engineer - xAI Technical Operations. Be the first to apply!
- on-site clinical research associate (traveling/remote) Ireland
- junior website developer Ireland
- business operations intern Ireland
- senior vice president of operations Ireland
- operations tech Ireland
- vice president of field operations Ireland
- operations support system engineer Ireland
- distributed mission operations Ireland
- site reliability engineering manager
- site reliability engineer remote
