Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Staff + Sr. Software Engineer, AI Reliability

$325k

Menlo Ventures

About Anthropic Anthropic’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. About the Role AIRE (AI Reliability Engineering) partners with teams across Anthropic to improve reliability across our most critical serving paths -- every hop from the SDK through our network, API layers, serving infrastructure, accelerators and back. We jump into the trenches alongside partner teams to make the systems that deliver Claude more robust and resilient, be it during an incident or collaborating on projects. Reliability here is an emergent phenomenon that transcends any single team's boundaries, so someone has to zoom out and look at the whole picture. That's us -- and it means few teams at Anthropic offer this kind of dynamic, cross-cutting exposure to the systems that matter most. Claude has your back. AIRE has Claude's. Help us keep Claude reliable for everyone who depends on it. Responsibilities Develop appropriate Service Level Objectives for large language model serving systems, balancing availability and latency with development velocity. Design and implement monitoring and observability systems across the token path. Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers. Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements. Support the reliability of safeguard model serving -- critical for both site reliability and Anthropic's safety commitments. You may be a good fit if you: Have strong distributed systems, infrastructure, or reliability backgrounds -- we're looking for reliability-minded software engineers and SREs. Are curious and brave -- comfortable jumping into unfamiliar systems during an incident and helping drive resolution even when you don't have deep expertise yet. Think holistically about how systems compose and where the seams are. Can build lasting relationships across teams -- our engagement model depends on being welcomed as teammates, not outsiders with opinions. Care about users and feel ownership over outcomes, even for systems you don't own. Have excellent communication and collaboration skills -- you'll be partnering across the entire company. Bring diverse experience -- the team's strength comes from people who've built product stacks, scaled databases, run massive distributed systems, and everything in between. Strong candidates may also: Have been an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems. Have experience operating large-scale model serving or training infrastructure (>1000 GPUs). Have experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium). Understand ML-specific networking optimizations like RDMA and InfiniBand. Have expertise in AI-specific observability tools and frameworks. Have experience with chaos engineering and systematic resilience testing. Have contributed to open-source infrastructure or ML tooling. Annual Salary: $325,000 – $485,000 USD Logistics Education requirements: We require at least a Bachelor's degree in a related field or equivalent experience. Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this. We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification as listed. Research shows that people who identify as being from underrepresented groups are more prone to experiencing imposter syndrome and doubting the strength of their candidacy, so we urge you not to exclude yourself prematurely and to submit an application if you're interested in this work. We think AI systems like the ones we're building have enormous social and ethical implications. We think this makes representation even more important, and we strive to include a range of diverse perspectives on our team. Your safety matters to us: To protect yourself from potential scams, remember that Anthropic recruiters only contact you from @anthropic.com email addresses. In some cases, we may partner with vetted recruiting agencies who will identify themselves as working on behalf of Anthropic. Be cautious of emails from other domains. Legitimate Anthropic recruiters will never ask for money, fees, or banking information before your first day. If you're ever unsure about a communication, don't click any links—visit directly for confirmed position openings. #J-18808-Ljbffr

Vacancy posted 3 hours ago
Similar jobs that could be interesting for youBased on the Staff + Sr. Software Engineer, AI Reliability in San Francisco, CA vacancy
  • $320k

     ...Staff + Sr. Software Engineer, Cloud Inference San Francisco, CA About Anthropic Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly... 
    Senior
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    2 days ago
  • $163k - $203k

     ...contributor on the SRE team, responsible for the reliability, scalability, and security of Prosper’s...  .... This is as much of a platform engineering role as it is SRE role — you will maintain...  ...realm. We are building an agentic AI‑first operations model where AI agents handle... 
    Senior
    Work experience placement
    Work at office
    Local area
    Remote work
    Flexible hours
    2 days per week

    Prosper.com

    San Francisco, CA
    3 days ago
  • $261k - $326k

     ...A technology company specializing in AI infrastructure is seeking a Principal Engineer to enhance reliability and scalability of cloud systems. This role demands over 15 years of experience in production engineering or related fields and involves setting technical directions... 
    Senior

    Crusoe

    San Francisco, CA
    3 hours ago
  • $160k - $300k

     ...About Hebbia The AI platform for investors and bankers that generates alpha and drives upside. Founded in 2020...  ...market leadership. The Role We are looking for a Site Reliability Engineer who thinks like a software engineer first. You will own critical production systems... 
    Suggested

    Hebbia, Inc.

    San Francisco, CA
    2 hours ago
  • $180k - $250k

     ...running at scale. You own the reliability and availability of customer-...  ...infrastructure Leverage AI to an extreme level to automate...  ...production issues, and improve software development speed, reliability...  ..., runbooks, and chaos engineering Requirements 5+ years experience... 
    Suggested
    Currently hiring
    Relocation
    Visa sponsorship

    Fal

    San Francisco, CA
    3 hours ago
  •  ...A technology company in San Francisco is seeking a DevOps Engineer to enhance the reliability and operational health of their production systems. You will set observability standards, build internal tooling, and partner with engineers for system design. The ideal candidate... 
    Senior

    Flux Enterprise

    San Francisco, CA
    2 hours ago
  • $230k

     ...Join the engineering teams that bring OpenAI's ideas safely to the world...  ...distribute the benefits of AI, while ensuring that this powerful...  ...that they are performant and reliable. You will work in a deeply...  ...-functional teams, including software engineers, product managers,... 
    Work experience placement
    Relocation package

    OpenAI

    San Francisco, CA
    1 day ago
  • $150k - $176k

     ...Software Engineer, Reliability Denver, Colorado, United States; San Francisco, California, United States Checkr is building the data platform...  ...40,000 companies and millions of people rely on Checkr for AI verification in the moments that matter most: getting a new... 
    Work at office
    Local area
    Remote work
    Relocation
    Flexible hours
    3 days per week

    Checkr

    San Francisco, CA
    1 day ago
  • $190k - $270k

     ...AI Chopping Block, Inc. is looking for an AI Infrastructure Engineer to maintain user-facing services and production systems. You'll lead operations with tools like...  ..., Terraform, and Kubernetes while ensuring reliability and scalability. The role requires a strong background... 
    Senior

    AI Chopping Block, Inc.

    San Francisco, CA
    3 hours ago
  • $325k

     ...Anthropic is seeking a Reliability Engineer to enhance the resilience of AI systems. The successful candidate will develop Service Level Objectives and design observability systems while leading incident responses for critical services. The ideal candidate has a strong... 
    Senior

    Menlo Ventures

    San Francisco, CA
    1 day ago
  • $121.5k - $145.5k

     ...Team/Role We are seeking a seasoned Sr. Software Engineer in the WEX Mobility Engineering...  ...documents, and ensure lasting performance and reliability. Conduct objective and...  ...and SQL ~ Experience in leveraging AI-enabled development tools such as Cursor... 
    Senior
    Remote work
    Flexible hours

    WEX

    San Francisco, CA
    5 days ago
  • $170k - $260k

     ...Sr. Software Engineer Job Summary At Pantomath, we are building the autopilot for the data-driven...  ...automate the entire lifecycle of data reliability. Our platform doesn't just monitor; it...  ...systems, infrastructure, and applied AI. You'll build critical systems that integrate... 
    Senior
    Work at office
    Remote work
    Night shift

    Pantomath Inc

    San Francisco, CA
    2 hours ago
  • $164.2k - $225.7k

     ...operating the world’s best data and AI infrastructure platform so...  ...business impact. Founded by engineers and driven by customer...  ...re only getting started. As a Sr. Software Engineer for Customer Experience...  ...upholding quality, safety, and reliability standards Design agentic... 
    Senior
    Local area
    Worldwide

    Databricks

    San Francisco, CA
    3 hours ago
  •  ...About the Team We’re hiring Software Engineers to join our Applied Infrastructure organization,...  ...shared mandate to raise the bar on safety, reliability, and velocity across OpenAI. About the...  ...powers some of the most widely used AI systems in the world. You’ll help ensure... 

    Slope

    San Francisco, CA
    3 hours ago
  • $179.4k - $263.12k

     ...About the Role You are a Data Engineer, who is passionate about writing beautiful code and...  ...build data transformations efficiently and reliably for different purposes (e.g. reporting,...  ...queries Hands‑on experience using modern AI coding assistants (e.g., Claude Code, Windsurf... 
    Senior
    Full time

    6sense

    San Francisco, CA
    4 hours ago
  • $190k - $270k

     ...AI Chopping Block, Inc. in San Francisco is seeking an AI Infrastructure Engineer to maintain user-facing services and production systems. The role involves building and...  ...tools like Ansible and Kubernetes, ensuring reliability and scalability. Candidates should have over... 
    Senior

    AI Chopping Block, Inc.

    San Francisco, CA
    3 hours ago
  • Engineering at Finalis Our engineering team is building...  ...capital markets. As a Senior Software Engineer, you'll work...  ...problems, and create ai-native technical...  ...growth. The Role As a Sr. Software Engineer at Finalis...  ...a passion for creating reliable, secure, and elegant... 
    Senior
    Work at office
    Remote work

    Finalis

    San Francisco, CA
    1 day ago
  • $180k - $220k

     ...future of healthcare with AI. As the leading provider of...  ...About the Role As a Sr. Infrastructure Engineer at AKASA, you'll work closely...  ...ensuring our infrastructure is reliable, observable, and easy to...  ...customers. You'll collaborate with software engineers to embed... 
    Senior
    Work at office
    Local area
    Remote work
    Home office
    Flexible hours

    Akasa, Inc

    San Francisco, CA
    3 days ago
  •  ...A cutting-edge AI startup in San Francisco is seeking a Senior Infrastructure Engineer to build platforms for AI agents. Your role will involve creating systems that other engineers rely on, ensuring reliability and fast deployment. You'll work with technologies like... 
    Senior

    Giga

    San Francisco, CA
    3 hours ago
  • $181.1k - $318.4k

     ...AIML - Sr. Software Development Engineer, Evaluation At Apple, we create world-class innovative products...  ...development and optimization of Apple's AI/ML features. Responsibilities:...  ...Strong ability and passion for creating reliable, resilient, high-performance,... 
    Senior
    Immediate start
    Relocation

    Apple

    San Francisco, CA
    3 days ago
  • $180k - $220k

     ...future of healthcare with AI. As the leading provider of...  ...About the Role As a Sr. Infrastructure Engineer at AKASA, you’ll work closely...  ...ensuring our infrastructure is reliable, observable, and easy to...  ...customers. You'll collaborate with software engineers to embed... 
    Senior
    Work at office
    Local area
    Remote work

    AKASA

    San Francisco, CA
    2 hours ago
  • $163k - $203k

     ...contributor on the SRE team, responsible for the reliability, scalability, and security of Prosper’s...  .... This is as much of a platform engineering role as it is SRE role — you will maintain...  ...realm.We are building an agentic AI-first operations model where AI agents handle... 
    Senior
    Work experience placement
    Work at office
    Local area
    Remote work
    Flexible hours
    2 days per week

    Prosper

    San Francisco, CA
    13 days ago
  • $166k - $267k

     ...The Role Pilot is hiring a Senior Software Engineer to join our Empowerment team. Team removes friction...  ...-party platforms Design and implement reliable workflow orchestration across services,...  ...systems Familiarity with agentic or AI‑assisted systems in production environments... 
    Senior
    Full time
    Temporary work
    Part time
    Work at office
    Flexible hours
    3 days per week

    Launch TN

    San Francisco, CA
    3 hours ago
  •  ...achieve more. About the Role As a Sr Software Engineer on the Auto Refinance team, you will...  ...web applications to deliver scalable, reliable solutions that improve customer outcomes...  ...cloud services Experience leveraging AI tools to improve engineering workflows... 
    Senior
    Work experience placement
    Work at office
    Local area
    Remote work
    Relocation
    Flexible hours

    LendingClub Bank

    San Francisco, CA
    3 days ago
  •  ...A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong... 
    Senior

    Hyperbolic Labs

    San Francisco, CA
    2 hours ago
  • $140k - $260k

     ...Profound AI Marketing Platform Profound is the marketing platform for the AI era....  ...backbone that turns complex AI work into reliable, composable workflows. You will shape the...  ...What You'll Do Build core workflow engine primitives used to orchestrate agents, tools... 
    Work at office
    Visa sponsorship
    Shift work

    Profound

    San Francisco, CA
    3 days ago
  • $149.6k - $308k

     ...you love? It’s Possible. At Pinterest, AI isn't just a feature, it's a powerful partner...  ...for inquisitive, well-rounded Backend engineers to join our Core, Monetization, and Tech...  ...Experience in following best practices in writing reliable and maintainable code that may be used by... 
    Senior
    Local area
    Relocation package

    Pinterest

    San Francisco, CA
    1 day ago
  • 53 Stations is seeking a DevOps Engineer to enhance the systems powering Flux's platform. You’ll tackle operations from billing to onboarding while ensuring high system reliability and performance. With a focus on collaboration and ownership, you will develop internal... 
    Senior

    53 Stations

    San Francisco, CA
    1 day ago
  •  ...OpenArt AI in San Francisco is seeking a Senior Platform & Reliability Engineer to design and improve the reliability of its infrastructure. The role emphasizes building...  ...of experience in production systems, strong software engineering skills, and familiarity with cloud-... 
    Senior

    OpenArt AI

    San Francisco, CA
    3 hours ago
  •  ...Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud...  ...to redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure... 
    Senior

    deCircle

    San Francisco, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff + Sr. Software Engineer, AI Reliability. Be the first to apply!