Staff + Sr. Software Engineer, AI Reliability
$325kUnited States Digital Space LLC
About the company the company’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. About the Role AIRE (AI Reliability Engineering) partners with teams across the company to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects. Reliability here is an emergent phenomenon that transcends any single team's boundaries, so someone has to zoom out and look at the whole picture. That's us -- and it means few teams at the company offer this kind of dynamic, cross-cutting exposure to the systems that matter most. Responsibilities: Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity Design and implement monitoring and observability systems across the token path Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud provider Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements Support the reliability of safeguard model serving -- critical for both site reliability and the company's safety commitments. You may be a good fit if you: Have strong distributed systems, infrastructure, or reliability backgrounds -- we're looking for reliability-minded software engineers and SREs Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet Think holistically about how systems compose and where the seams are Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions Care about users and feel ownership over outcomes, even for systems you don't own Have excellent communication and collaboration skills -- you'll be partnering across the entire company Bring diverse experience -- the team's strength comes from people who've built product stacks, scaled databases, run massive distributed systems, and everything in between. Strong candidates may also: Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems Have experience operating large-scale model serving or training infrastructure (>1000 GPUs) Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium) Understand ML-specific networking optimizations like RDMA and InfiniBand Have expertise in AI-specific observability tools and frameworks Have experience with chaos engineering and systematic resilience testing Have contributed to open-source infrastructure or ML tooling. Annual Salary: $325,000 — $485,000 USD Logistics Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this. We encourage you to apply even if you do not believe you meet every single qualification. #J-18808-Ljbffr
$320k
...Cloud Inference Engineer Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial... ...Have significant software engineering experience, with... ...: Currently, we expect all staff to be in one of our offices...SeniorWork at officeVisa sponsorshipFlexible hours$163k - $203k
...contributor on the SRE team, responsible for the reliability, scalability, and security of Prosper’s... .... This is as much of a platform engineering role as it is SRE role — you will maintain... ...realm. We are building an agentic AI‑first operations model where AI agents handle...SeniorWork experience placementWork at officeLocal areaRemote workFlexible hours2 days per week$180k - $220k
...future of healthcare with AI. As the leading provider of... ...About the Role As a Sr. Infrastructure Engineer at AKASA, you’ll work closely... ...ensuring our infrastructure is reliable, observable, and easy to... ...customers. You'll collaborate with software engineers to embed...SeniorWork at officeLocal areaRemote work$180k - $250k
...running at scale. You own the reliability and availability of customer-... ...infrastructure Leverage AI to an extreme level to automate... ...production issues, and improve software development speed, reliability... ..., runbooks, and chaos engineering Requirements 5+ years experience...SuggestedCurrently hiringRelocationVisa sponsorship$160k - $300k
...About Hebbia The AI platform for investors and bankers that generates alpha and drives upside. Founded in 2020... ...market leadership. The Role We are looking for a Site Reliability Engineer who thinks like a software engineer first. You will own critical production systems...Suggested$261k - $326k
A technology company specializing in AI infrastructure is seeking a Principal Engineer to enhance reliability and scalability of cloud systems. This role demands over 15 years of experience in production engineering or related fields and involves setting technical directions...Senior$179.4k - $263.12k
...About the Role You are a Data Engineer, who is passionate about writing beautiful code and... ...build data transformations efficiently and reliably for different purposes (e.g. reporting,... ...queries Hands‑on experience using modern AI coding assistants (e.g., Claude Code, Windsurf...SeniorFull time$230k
...Join the engineering teams that bring OpenAI's ideas safely to the world... ...distribute the benefits of AI, while ensuring that this powerful... ...that they are performant and reliable. You will work in a deeply... ...-functional teams, including software engineers, product managers,...Work experience placementRelocation package- ...contribute to a high-performing engineering team through collaboration,... ...* Own the quality and reliability of services through improvements... ...and team processes* Leverage AI-assisted tooling (e.g., for code... ...of professional experience in software engineering* Significant...SeniorTemporary workWork at office
$325k
...Anthropic is seeking a Reliability Engineer to enhance the resilience of AI systems. The successful candidate will develop Service Level Objectives and design observability systems while leading incident responses for critical services. The ideal candidate has a strong...Senior$180k - $220k
...future of healthcare with AI. As the leading provider of... ...About the Role As a Sr. Infrastructure Engineer at AKASA, you'll work closely... ...ensuring our infrastructure is reliable, observable, and easy to... ...customers. You'll collaborate with software engineers to embed...SeniorWork at officeLocal areaRemote workHome officeFlexible hours$190k - $270k
AI Chopping Block, Inc. is looking for an AI Infrastructure Engineer to maintain user-facing services and production systems. You'll lead operations with tools like... ..., Terraform, and Kubernetes while ensuring reliability and scalability. The role requires a strong background...Senior$140k - $260k
...Profound AI Marketing Platform Profound is the marketing platform for the AI era.... ...backbone that turns complex AI work into reliable, composable workflows. You will shape the... ...What You'll Do Build core workflow engine primitives used to orchestrate agents, tools...Work at officeVisa sponsorshipShift work$149.6k - $308k
...you love? It’s Possible. At Pinterest, AI isn't just a feature, it's a powerful partner... ...for inquisitive, well-rounded Backend engineers to join our Core, Monetization, and Tech... ...Experience in following best practices in writing reliable and maintainable code that may be used by...SeniorLocal areaRelocation package- ...Algora Public Benefit Corporation is looking for an AI Cloud Infra Engineer to join their team in San Francisco. You will ensure the reliability of backend systems and work closely with engineers to plan for future growth. The ideal candidate has strong cloud infrastructure...Senior
$170k - $260k
Sr. Software Engineer Job Summary At Pantomath, we are building the autopilot for the data-driven... ...automate the entire lifecycle of data reliability. Our platform doesn't just monitor; it... ...systems, infrastructure, and applied AI. You'll build critical systems that integrate...SeniorWork at officeRemote workNight shift- ...achieve more. About the Role As a Sr Software Engineer on the Auto Refinance team, you will... ...web applications to deliver scalable, reliable solutions that improve customer outcomes... ...cloud services Experience leveraging AI tools to improve engineering workflows...SeniorWork experience placementWork at officeLocal areaRemote workRelocationFlexible hours
- About the Team We’re hiring Software Engineers to join our Applied Infrastructure organization, and... ...mandate to raise the bar on safety, reliability, and velocity across OpenAI. About the... ...that powers some of the most widely used AI systems in the world. You’ll help ensure...
$164.2k - $225.7k
...operating the world’s best data and AI infrastructure platform so... ...business impact. Founded by engineers and driven by customer... ...only getting started. As a Sr. Software Engineer for Customer Experience... ...upholding quality, safety, and reliability standards Design agentic...SeniorLocal areaWorldwide$190k - $270k
AI Chopping Block, Inc. in San Francisco is seeking an AI Infrastructure Engineer to maintain user-facing services and production systems. The role involves building and... ...tools like Ansible and Kubernetes, ensuring reliability and scalability. Candidates should have over...Senior- 53 Stations is seeking a DevOps Engineer to enhance the systems powering Flux's platform. You’ll tackle operations from billing to onboarding while ensuring high system reliability and performance. With a focus on collaboration and ownership, you will develop internal...Senior
- ...A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong...Senior
- A cutting-edge AI startup in San Francisco is seeking a Senior Infrastructure Engineer to build platforms for AI agents. Your role will involve creating systems that other engineers rely on, ensuring reliability and fast deployment. You'll work with technologies like Python...Senior
- Reducto, Inc. is seeking an Infrastructure Engineer to design, build, and maintain scalable infrastructure for AI and ML workloads. The role involves automating cloud... ...robust monitoring systems to ensure reliability. With a requirement of 5+ years of experience...Senior
- OutSystems, Inc. is looking for a Site Reliability Engineer to join their team in San Francisco, CA. The ideal candidate will lead the onboarding of services and teams to reliability tenets while establishing SLOs and SLAs. Proficiency in Python and experience with Kubernetes...SeniorFlexible hours
$130.2k - $195.3k
...mark on culture. Job Title: Senior Software Engineer (Video) Location: Burbank, CA / New York... ...workflows. This engineer will leverage AI-powered tools to accelerate development... ...Monitor and improve the performance, reliability, and scalability of microservices across...SeniorLocal area- ...technology firm based in San Francisco is seeking a DevOps Engineer to enhance the reliability of their production systems. You will collaborate with... ...Join us in our mission to revolutionize hardware design through innovative AI solutions. #J-18808-Ljbffr Flux EnterpriseSenior
$173k - $205k
...help us get there. As a Senior Frontend Engineer on the Personal Loans team, you will own... ...line growth. We are investing heavily in AI across our engineering organization— from... ...GraphQL and RESTful APIs in a performant, reliable, and secure manner. Identify and drive improvements...SeniorWork at officeLocal areaRemote workFlexible hours2 days per week- ...Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud... ...to redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure...Senior
$180k - $200k
...Ironclad is the leading AI-powered contract lifecycle management platform, processing... ...days for team or company events. _ Software Engineer, Platform Infrastructure sits under the... ...and systems to provide our customers with reliable, secure, and scalable software. Roles...SeniorContract workWork at office
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Staff + Sr. Software Engineer, AI Reliability. Be the first to apply!
- software developer internship no experience San Francisco, CA
- federal - software developer San Francisco, CA
- research software engineer San Francisco, CA
- software engineer contract San Francisco, CA
- part time software developer San Francisco, CA
- software engineer healthcare San Francisco, CA
- network software engineer San Francisco, CA
- ngo software engineer San Francisco, CA
- software development engineer aws San Francisco, CA
- software developer internship San Francisco, CA


