Site Reliability Engineer - xAI Technical Operations
$180k - $400kxAI
About xAI
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers and researchers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.
About the Role
AI is building at a furious pace with the latest hardware to help people understand the universe and we are in need of Site Reliability Engineers (SREs) who have at least 8+ years of experience in distributed, internet-scale environments, including on-prem and cloud-based infrastructure.
You will own the availability and reliability of xAI's infrastructure and core services, including detecting issues, problem management, incident management, and root cause analyses (RCAs). Engineers will own the availability of xAI infrastructure and its operations processes applying concepts like failure domains, blast radii, and canary testing. You will be expected to participate in a team on-call rotation and to contribute to ushering xAI into the next generation of infrastructure management across multiple data centers and cloud environments.
Responsibilities Will Include
- Setting technical strategy and roadmap for infrastructure availability.
- Automating monitoring, alerting, and troubleshooting for high-availability services, while working with legacy systems to scale, improve, or deprecate.
- Owning incident response, problem management, and conducting thorough RCAs to prevent recurrence and drive continuous improvement.
- Analyzing performance metrics and service health to identify, resolve, and mitigate bottlenecks or failures in distributed environments.
- Ensuring security, scalability, and resilience of production infrastructure supporting AI workloads.
Location
Work will be in-office based out of either Palo Alto, California or Dublin, Ireland.
Required Qualifications
- A minimum of 8 years of software, systems or reliability engineering experience.
- Experience managing services in distributed, internet-scale *nix environments, including on-prem and cloud (e.g., AWS, GCP).
- Development experience in Python, Scala, Java, C, or C++.
- Demonstrable knowledge of TCP/IP, Networking and systems programming (e.g., bash and shell tools).
- Familiarity with containerization and orchestration tools (e.g., Kubernetes, Docker, Mesos) and systems management (e.g., Puppet, Chef, Ansible).
- Bachelor's degree in Computer Science, Electrical Engineering, or a related field (or equivalent experience).
Preferred Experiences
- Experience in on-call rotations and incident response in high-stakes environments.
- Experience with AI/ML infrastructure, large-scale GPU clusters
- Strong problem-solving skills and ability to thrive in a fast-paced, ambiguous setting.
- Comfortable with deployment, support, monitoring, administration, and troubleshooting across on-prem, cloud and hybrid infrastructures.
- Proven understanding of systems and application design, including operational trade-offs.
Interview Process
After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to an initial interview (45 minutes - 1 hour) during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of four interviews:
- Coding assessment in a language of your choice.
- Site reliability and operations technologies.
- Manager Interview.
- Meet and greet with the team with a presentation of a large scale solution or problem you owned, start to finish.
Our goal is to finish the main process within one week. We don’t rely on recruiters for assessments. Every application is reviewed by a member of our technical team. All interviews will be conducted via Google Meet.
Annual Salary Range
$180,000 - $400,000 USD
Benefits
Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.
xAI is an equal opportunity employer.
- ...greatest potential. Title and Summary Site Reliability Engineer I-1 The Next Edge BizOps team is... ...automate everything you can? Business Operations is leading the DevOps transformation... ...in Computer Science or related technical field involving coding (e.g., physics...OperationsFull timeWorldwideShift work
- ...potential. Title and Summary Director, Site Reliability Engineering Who is Mastercard? Mastercard... ...this role will focus on leading our operational presense in Europe as well as owning... ...candidate will have strong hands on technical experience across our core...OperationsFull timeWorldwide
- ...their greatest potential. Title and Summary Lead, SRE Engineer Lead SRE Engineer, Site Reliability Engineering Our Purpose: Mastercard powers... ...applications. Our mission is to ensure these components operate with excellence, enabling applications to deliver an...OperationsFull timeWorldwide
- ...greatest potential. Title and Summary Lead SRE Network Engineer Lead Network Engineer, Site Reliability Engineering Our Purpose: Mastercard powers... .... Our mission is to ensure these components operate with excellence, enabling applications to deliver an...OperationsFull timeWorldwide
- ...Title and Summary Software Engineer II in Test (SDET) Who is Mastercard... ...to join the Decision Operations team in Dublin. This role... ...our decisioning systems are reliable, scalable, and secure. Role... ...code, and contribute to technical documentation. Advocate for...OperationsFull timeWorldwide
- ...potential. Title and Summary AI engineer II Who is Mastercard?... ...II to support the build and operation of applied AI solutions. This... ..., and learning how to build reliable, scalable AI systems in a... ...• Strong interest in growing technical depth in AI engineering and ML...OperationsFull timeWorldwide
- ...business including, but not limited to, on the floor sales, business operations, outside marketing, customer service and retention, employee... ...service and handling escalations Demonstrate solid technical competence for all products and services sold Engage in community...Operations
- ...Title and Summary Manager, Software Engineering Overview The Corporate Client Experience... ...applications. • Work closely with technical leads, architects, and product owners... ...etc. • Provide automation tests for operations teams to use in Ci/Cd automated quality...OperationsFull timeWorldwide
- ...and Summary Manager, Product Management-Technical Manager, Product Management-... ...Product Strategy, Product Management, Engineering, Customer delivery, Support chain community... ...dependent applications/services, runtime operations (i.e. trouble management/associated support...OperationsFull timeWorldwide
- ...platform. We’re hiring a Senior Software Engineer II to join the Flywheel Context team - a... ...What You’ll Do Design, build, and operate backend services that power context retrieval... ...frameworks to ensure context accuracy, reliability, and performance. Collaborate closely...OperationsRemote jobWork at office
- ...potential. Title and Summary Senior Software Engineer in Test The Mastercard Consumer Data... .... • Work closely with Product Owners, Technical leads and other developers to define... ...business needs. • Automate build, operate, and run aspects of software Skills:...Full timeWork experience placementWorldwide
$65 - $120 per hour
...written and verbal communication skills to clearly articulate technical concepts and feedback. Strong attention to detail and a passion... ...Qualifications: Experience with AI/ML, LLMs, prompt engineering, or similar emerging technologies. Active GitHub or other public...Remote jobHourly payPart time- ...Title and Summary Principal Software Engineer Who is Mastercard? Mastercard is a... ...responsible for designing, building, and operating the technology that powers Mastercard’s... ...teams. You will also lead by defining the technical strategy, architecture, design, and execution...Full timeWork experience placementWorldwide
- ...and Summary Principal Oracle Platform Engineer Principal Platform Engineer (Database... ...routine maintenance to design complex, multi-site replication strategies and modern... ...architecture-heavy role (70%), you will remain technically "sharp" by leading high-level...Full timeWorldwide
- ...and Summary Director of AI Engineering Overview Mastercard is... ...solutions. The role requires strong technical judgement, delivery... ...enterprise requirements for reliability, security, and governance.... ...delivery oversight Comfortable operating in a fast‑moving, evolving...Full timeWorldwideShift work
- ...Summary Software Development Engineer II - Data and Analytics... ...clients. We are seeking a technically strong Software Development... ...best practices, and support operational excellence. You will work... ...while helping the team deliver reliable, high-quality software....Full timeImmediate startWorldwide
- ...combat climate change, and reliably connect humans and the world... .... Product Applications Engineer About the Role As a Product... ...to market, applying your technical expertise to solve real-... ...specifications, understand operational boundaries, and ensure performance...Permanent employmentWork at officeRemote work
- ...Title and Summary Lead AI engineer Who is Mastercard? Mastercard... ...individual contributor and technical leadership role. You will... ...deployment • Build and operate ML/AI services, pipelines, and... ...standards for performance, reliability, security, and governance...Full timeWorldwide
- ...Title and Summary Manager, Software Engineering Overview The Mastercard Fraud Platform... ...we do. The ideal candidate will be technically proficient with strong experience... .... · Be a champion of engineering and operational excellence: ensure organizational metrics...Full timeLocal areaWorldwide
- ...Summary Director of Software Engineering Mastercard is seeking a... ...responsible for building and operating the software platforms and... ...platform services, and system reliability, ensuring that AI... ...the engineering delivery and technical direction of the software engineering...Full timeWorldwideShift work
- ...their greatest potential. Title and Summary Director, Platform Engineering (vmware) Who is Mastercard? Mastercard is a global... ...scaling our service as we experience rapid growth, ensuring operational resliancy and continuing our automation journey. The ideal...Full timeWorldwide
- ...realize their greatest potential. Title and Summary Lead Platform Engineer – AWS Cloud DevOps Engineer Overview Mastercard’s... ...BitBucket/GitHub, Artifactory, and Sonarqube. • Drive platform reliability and scalability through automation, observability (Splunk,...Full timeWorldwide
- ...Description Summary: Site lead for Contract... ...Responsibilities: Manufacturing operations : Develop and lead... ...processes, maintain reliable relationships with... ...Administration, Engineering, or Science-related field... ...requirements. Technical writing and multi-level...OperationsContract work
- ...Centers. At Tesla, our Mechanics are the backbone of the Service operation, supporting our mission to accelerate the world’s transition... ...the repair of Electrical Vehicles. What You'll Bring Technically experienced: You have professional experience performing vehicle...OperationsFull timeLocal areaFlexible hoursShift workDay shiftAfternoon shift
- ...geography or circumstance; Leadership – Advancing sustainable operations and innovative solutions to improve patient health; and... ...investigations, and implementing corrective/preventive actions. Technical Proficiency: Proficient in the use of contamination control...OperationsWorldwide
- ...mechanical or electrical technical knowledge within... ...customer satisfaction and reliability. - Ensure boats are... ...ready to travel to off-site locations for... ...commissioning, and gas dock operations as required. - Take... ...pumps, batteries, diesel engines, electrical, propane,...OperationsPermanent employmentFull timeSummer workRotating shift
- ...point of contact between Business Units, Operations, Technology, and Global Product teams.... ...Technology, Computer Science, Engineering, Business, or a related field. • Recent... ...Ability to adapt and learn in a fast-paced, technical environment. Languages: • Fluent English...OperationsFull timeInternshipWorldwide
- ...OneSource services. Job Title OneSource Customer Operations Location(s)Sanofi Waterford (Ireland) - Customer Site This role of OneSource Customer Operations has... ...relating to the intervention ~ act as the technical interface with service providers by providing...OperationsContract workWork at officeRemote work
- ...for designing, building, and operating cloud-based platforms that... ...and deep experience in data engineering and platform delivery, with... ...scalable, developer-friendly technical data products and data capabilities... ...engineering standards for reliability, performance, cost...OperationsFull timeWorldwide
- ...customer contract lifecycle, partnering closely with sales, sales operations, legal, finance, and other cross-functional teams to ensure... ...data integrity and process adherence within Salesforce. Technical Skills / Competencies Solid understanding of contract lifecycle...OperationsContract workWork at officeFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Site Reliability Engineer - xAI Technical Operations. Be the first to apply!
- on-site clinical research associate (traveling/remote) Ireland
- vice president of field operations Ireland
- operations tech Ireland
- site reliability engineer
- site reliability engineer sre
- junior site reliability engineer
- lead site reliability engineer
- site reliability engineer remote
- site reliability engineering manager
- on site coordinator

