Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Site Reliability Engineer - AI Infrastructure

$250k

Hamilton Barnes Associates Limited

Are you looking for an exciting new opportunity? Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration. Responsibilities Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimize resource scheduling, GPU utilization, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Skills / Must Have 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments. Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus. Benefits IPO Equity 10% company bonus 401K 4% match Salary $250,000 gross per year #J-18808-Ljbffr Hamilton Barnes Associates Limited

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Site Reliability Engineer - AI Infrastructure in San Francisco, CA vacancy
  • Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only... 
    Suggested
    Full time
    Remote work

    Andromeda Cluster

    San Francisco, CA
    2 days ago
  • Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early‑stage startups access to the kind of scaled AI infrastructure once reserved... 
    Suggested
    Full time
    Remote work

    Cortes 23

    San Francisco, CA
    5 days ago
  • A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong... 
    Suggested

    Hyperbolic Labs

    San Francisco, CA
    2 days ago
  •  ...building the category-defining AI workflow automation platform that...  ...’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems...  ...fully implemented and measured. Infrastructure and Platform Operations:... 
    Suggested
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    3 days ago
  •  ...security, delivering an AI-powered platform that...  .... As a Staff Platform Engineer, you will play a...  ...leadership role. You will own reliability for major platform...  ...maintaining the shared infrastructure services and platforms...  ...Platform Engineering, or Site Reliability... 
    Suggested

    Saviynt

    San Francisco, CA
    2 days ago
  • $125k - $165k

    A leading innovator in laboratory software is seeking a Site Reliability Engineer in San Francisco, CA. The role focuses on ensuring reliability and performance of AI systems, managing production infrastructure, and operating resilient systems in cloud environments. The... 

    TELCOR

    San Francisco, CA
    3 days ago
  • $232k - $319k

     ...Secure Every Identity, from AI to Human Identity is the key...  ...the trusted, neutral infrastructure that enables organizations to...  ...service with great people and reliable, cost-effective, and efficient...  ...velocity of SRE and product engineering by developing robust platforms... 
    Permanent employment
    Local area
    Worldwide
    Flexible hours

    Okta, Inc.

    San Francisco, CA
    9 hours ago
  • $227.2k - $324.5k

     ...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations team....  ...Partner with infra lead to align Tubi's infrastructure & SRE roadmap. Partner with tech...  ...for our observability and SRE related AI platforms, work with infra lead and finance... 
    Full time
    Contract work
    Temporary work
    Local area
    Flexible hours

    Tubi

    San Francisco, CA
    4 days ago
  • $163k - $203k

     ...SRE team, responsible for the reliability, scalability, and security...  ...This is as much of a platform engineering role as it is SRE role — you...  ....We are building an agentic AI-first operations model where...  ...compute (managed by the Infrastructure Engineering team) across all... 
    Work experience placement
    Work at office
    Local area
    Remote work
    Flexible hours
    2 days per week

    Prosper

    San Francisco, CA
    24 days ago
  • $166.9k - $225.9k

     ...operates as both a central engineering function and an embedded reliability practice. You'll be part...  ...reliability. Our infrastructure runs on AWS across multiple...  ...years of experience in Site Reliability Engineering,...  ...Experience with AIOps—using AI/ML‑based tooling for... 
    Flexible hours

    Drata

    San Francisco, CA
    1 day ago
  • Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing...  ...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability,... 

    deCircle

    San Francisco, CA
    4 days ago
  • About HappyRobot HappyRobot is the infrastructure for enterprises to build and orchestrate AI workforces. Our AI workers don'...  ...looking for an Infrastructure Engineer to take the lead on scaling our...  ...role where you’ll shape how reliability is done - reducing incident load... 
    Worldwide
    Shift work

    Happyrobot Inc.

    San Francisco, CA
    5 days ago
  •  ...Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands...  ...day is currently Tuesday. Engineering at Lambda is responsible...  ...and improve product reliability. Lead members of other engineering...  ...5+ years of experience in Site Reliability Engineering... 
    Work at office
    Local area
    Work from home

    Lambda

    San Francisco, CA
    4 days ago
  • TELCOR Inc is looking for a Site Reliability Engineer to ensure the reliability, scalability, and performance of our AI products' systems. The role involves designing and operating...  ...environments while managing production infrastructure and deployment workflows. The ideal... 
    Remote job

    TELCOR Inc

    San Francisco, CA
    3 days ago
  •  ...the next generation of Gen AI‑driven code reviewers: a symbiotic...  ...outperforms individual engineers. We combine language models...  ...are seeking an experienced Site Reliability Engineer to join our Platform...  ...automation platforms, and owning the infrastructure that powers our AI‑driven... 

    CodeRabbit

    San Francisco, CA
    5 days ago
  • OutSystems, Inc. is looking for a Site Reliability Engineer to join their team in San Francisco, CA. The ideal candidate will lead the onboarding of services and teams to reliability tenets while establishing SLOs and SLAs. Proficiency in Python and experience with Kubernetes... 
    Flexible hours

    OutSystems, Inc.

    San Francisco, CA
    5 days ago
  • Happyrobot Inc. is looking for an Infrastructure Engineer in San Francisco, California. This role involves leading the stability and observability...  ...familiarity with monitoring tools. Join us at a high-growth AI startup backed by top investors, where you will have... 

    Happyrobot Inc.

    San Francisco, CA
    5 days ago
  • $163k - $203k

     ...SRE team, responsible for the reliability, scalability, and security...  ...This is as much a platform engineering role as it is an SRE role— you...  ...We are building an agentic AI‑first operations model where...  ...based compute (managed by the Infrastructure Engineering team) across all... 
    Work experience placement
    Work at office
    Remote work
    Flexible hours
    2 days per week

    GoTo Meeting

    San Francisco, CA
    5 days ago
  • $125k - $165k

    Position Site Reliability Engineer Location Lincoln, NE, San Francisco, CA, or Remote Job ID 434...  ...performance of the systems that power our AI products. This role will also design...  ...environments, and manage production infrastructure and deployment workflows across... 
    Temporary work
    Remote work
    Visa sponsorship
    Work visa
    Flexible hours

    TELCOR Inc

    San Francisco, CA
    3 days ago
  •  ...Connor was a machine learning research engineer at Scale AI. The rest of our team comes from...  ...Senior SRE, you'll tackle the scaling and reliability challenges that come with adding...  ...scale. What You'll Do Scale our data infrastructure: Optimize and extend our ClickHouse and... 

    Unify

    San Francisco, CA
    5 days ago
  • $151.5k - $252.5k

    Veeam is the Data and AI Trust Company, specializing in helping organizations...  ...are looking for an experienced Senior Site Reliability Engineer to join the Veeam Data Cloud (VDC)...  ...stack based on containers, serverless infrastructure, Golang, public cloud services in the... 
    Base plus commission
    Local area
    Worldwide

    Veeam

    San Francisco, CA
    1 day ago
  •  ...services and teams to the reliability tenets. Establish and maintain...  ..., reliable, and secure infrastructure, ensuring cloud‑native...  ...Program in Python, using Gen AI tooling to accelerate automation...  ...6+ years of experience in Site Reliability Engineering, managing infrastructure... 

    OutSystems, Inc.

    San Francisco, CA
    5 days ago
  • $60 per hour

    Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law firms, reshaping how data drives impactful...  ...hotfixes — while also automating infrastructure, monitoring systems, and GitHub... 
    Full time
    Work at office
    Flexible hours

    Bonfirevc

    San Francisco, CA
    5 days ago
  • # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission &...  ...operates as both a central engineering function and an embedded reliability...  ...approach reliability.Our infrastructure runs on AWS across multiple...  ...with AIOps - using AI/ML-based tooling for anomaly... 
    Work at office
    Immediate start
    Worldwide
    Monday to Friday
    Flexible hours

    Careers at Drata

    San Francisco, CA
    1 day ago
  •  ...more information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site...  ...software engineering and applies them to infrastructure and operations problems. The main goals...  ...Programming in Python supported by Gen AI tooling to accelerate development of... 
    Immediate start
    Remote work
    Worldwide

    OutSystems Inc.

    San Francisco, CA
    5 days ago
  • $127k - $249k

    The Team Platform Engineering is the department within SRE that is...  ...responsible for a range of critical infrastructure and operational functions...  ...that ensure cluster reliability and security (e.g., CoreDNS,...  ...redefined the database for the AI era, enabling innovators to... 
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    San Francisco, CA
    2 days ago
  • The role We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platform. You’ll be building and operating the core systems that power agentic AI at scale. Your mission: keep our ultra... 

    Blaxel

    San Francisco, CA
    1 day ago
  •  ...changing that, using AI to disrupt a massive market...  ...the role Gamma's infrastructure needs to be rock-solid...  ...users while enabling our engineering teams to ship fast....  ...tooling that improves reliability and partnering with engineering...  ...ll bring 5+ years in Site Reliability... 
    Work at office
    Work from home

    gamma.app

    San Francisco, CA
    3 days ago
  •  ...human. Heidi is building an AI Care Partner that works alongside...  .... We’re a team of doctors, engineers, designers, researchers, and...  ...-end. Improve operational reliability: Identify recurring issues...  ...Kubernetes clusters, cloud infrastructure, and core platform services,... 
    Work at office
    Worldwide

    Heidi Health Ltd

    San Francisco, CA
    5 days ago
  • $125k - $165k

    Position: Site Reliability Engineer Location: San Francisco, CA Job Id: 434 # of Openings: 1 TELCOR Inc, a leading...  ...Reliability Engineer to join our TELCOR AI Systems team! Do you have strong experience in cloud infrastructure, distributed systems and production... 
    Temporary work
    Work at office
    Visa sponsorship
    Work visa
    Relocation package
    Flexible hours

    TELCOR

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer - AI Infrastructure. Be the first to apply!