Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Site Reliability Engineer - AI Infrastructure

$250k

Hamilton Barnes Associates Limited

Are you looking for an exciting new opportunity? Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration. Responsibilities Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimize resource scheduling, GPU utilization, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Skills / Must Have 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments. Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus. Benefits IPO Equity 10% company bonus 401K 4% match Salary $250,000 gross per year #J-18808-Ljbffr Hamilton Barnes Associates Limited

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Site Reliability Engineer - AI Infrastructure in San Francisco, CA vacancy
  • Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only... 
    Suggested
    Full time
    Remote work

    Andromeda Cluster

    San Francisco, CA
    1 day ago
  • A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong... 
    Suggested

    Hyperbolic Labs

    San Francisco, CA
    1 day ago
  • $125k - $165k

    A leading innovator in laboratory software is seeking a Site Reliability Engineer in San Francisco, CA. The role focuses on ensuring reliability and performance of AI systems, managing production infrastructure, and operating resilient systems in cloud environments. The... 
    Suggested

    TELCOR

    San Francisco, CA
    2 days ago
  •  ...security, delivering an AI-powered platform that...  .... As a Staff Platform Engineer, you will play a...  ...leadership role. You will own reliability for major platform...  ...maintaining the shared infrastructure services and platforms...  ...Platform Engineering, or Site Reliability... 
    Suggested

    Saviynt

    San Francisco, CA
    4 days ago
  • $238k - $290k

     ...operate. By combining frontier agentic AI, an enterprise-grade platform, and...  ...Overview As a Staff Software Engineer on the Site Reliability team at Harvey, you will ensure the...  ...team that sits at the intersection of infrastructure and product, owning the systems that... 
    Suggested
    Relocation package

    Harvey

    San Francisco, CA
    4 days ago
  • $125k - $165k

    Position: Site Reliability Engineer Location: San Francisco, CA Job Id: 434 # of Openings: 1 TELCOR Inc, a leading...  ...Reliability Engineer to join our TELCOR AI Systems team! Do you have strong experience in cloud infrastructure, distributed systems and production... 
    Temporary work
    Work at office
    Visa sponsorship
    Work visa
    Relocation package
    Flexible hours

    TELCOR

    San Francisco, CA
    2 days ago
  • $163k - $203k

     ...SRE team, responsible for the reliability, scalability, and security...  ...This is as much of a platform engineering role as it is SRE role — you...  ...We are building an agentic AI‑first operations model where...  ...compute (managed by the Infrastructure Engineering team) across all... 
    Work experience placement
    Work at office
    Local area
    Remote work
    Flexible hours
    2 days per week

    Prosper

    San Francisco, CA
    21 hours ago
  •  ...changing that, using AI to disrupt a massive market...  ...the role Gamma's infrastructure needs to be rock-solid...  ...users while enabling our engineering teams to ship fast....  ...tooling that improves reliability and partnering with engineering...  ...ll bring 5+ years in Site Reliability... 
    Work at office
    Work from home

    gamma.app

    San Francisco, CA
    2 days ago
  •  ...human. Heidi is building an AI Care Partner that works alongside...  .... We’re a team of doctors, engineers, designers, researchers, and...  ...-end. Improve operational reliability: Identify recurring issues...  ...Kubernetes clusters, cloud infrastructure, and core platform services,... 
    Work at office
    Worldwide

    Heidi Health Ltd

    San Francisco, CA
    4 days ago
  • TELCOR Inc is looking for a Site Reliability Engineer to ensure the reliability, scalability, and performance of our AI products' systems. The role involves designing and operating...  ...environments while managing production infrastructure and deployment workflows. The ideal... 
    Remote job

    TELCOR Inc

    San Francisco, CA
    2 days ago
  • $125k - $165k

    Position Site Reliability Engineer Location Lincoln, NE, San Francisco, CA, or Remote Job ID 434...  ...performance of the systems that power our AI products. This role will also design...  ...environments, and manage production infrastructure and deployment workflows across... 
    Temporary work
    Remote work
    Visa sponsorship
    Work visa
    Flexible hours

    TELCOR Inc

    San Francisco, CA
    2 days ago
  •  ...Connor was a machine learning research engineer at Scale AI. The rest of our team comes from...  ...Senior SRE, you'll tackle the scaling and reliability challenges that come with adding...  ...scale. What You'll Do Scale our data infrastructure: Optimize and extend our ClickHouse and... 

    Unify

    San Francisco, CA
    4 days ago
  • Happyrobot Inc. is looking for an Infrastructure Engineer in San Francisco, California. This role involves leading the stability and observability...  ...familiarity with monitoring tools. Join us at a high-growth AI startup backed by top investors, where you will have... 

    Happyrobot Inc.

    San Francisco, CA
    4 days ago
  • $165k - $225k

     ...ecosystem. SDF is looking for a Senior Site Reliability Engineer to help build and operate the...  ...our systems, design and improve the infrastructure behind our production environments, and...  ...source code Experience experimenting with AI-driven approaches to operations Compensation... 
    Temporary work
    Work at office
    Local area
    Worldwide
    Flexible hours

    Stellar

    San Francisco, CA
    2 days ago
  • $151.5k - $252.5k

    Veeam is the Data and AI Trust Company, specializing in helping organizations...  ...are looking for an experienced Senior Site Reliability Engineer to join the Veeam Data Cloud (VDC)...  ...stack based on containers, serverless infrastructure, Golang, public cloud services in the... 
    Base plus commission
    Local area
    Worldwide

    Veeam

    San Francisco, CA
    21 hours ago
  •  ...building the category-defining AI workflow automation platform that...  ...’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems...  ...fully implemented and measured. Infrastructure and Platform Operations:... 
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    21 hours ago
  • Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing...  ...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability,... 

    deCircle

    San Francisco, CA
    3 days ago
  • $166.9k - $225.9k

     ...operates as both a central engineering function and an embedded reliability practice. You'll be part...  ...reliability. Our infrastructure runs on AWS across multiple...  ...years of experience in Site Reliability Engineering,...  ...Experience with AIOps—using AI/ML‑based tooling for... 
    Flexible hours

    Drata

    San Francisco, CA
    21 hours ago
  • About HappyRobot HappyRobot is the infrastructure for enterprises to build and orchestrate AI workforces. Our AI workers don'...  ...looking for an Infrastructure Engineer to take the lead on scaling our...  ...role where you’ll shape how reliability is done - reducing incident load... 
    Worldwide
    Shift work

    Happyrobot Inc.

    San Francisco, CA
    4 days ago
  • $140.3k - $191.55k

     ...organizations with a goal of using the latest AI, GenAI, LLM, Cloud, and Digital...  ...and regulatory paperwork. Site Reliability Engineer Location: Atlanta, GA; Miami, FL; Cambridge...  ...maintenance of applications or systems infrastructure for large-scale customer-facing companies... 
    Temporary work
    Work experience placement

    Writemed

    San Francisco, CA
    4 days ago
  • # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission &...  ...operates as both a central engineering function and an embedded reliability...  ...approach reliability.Our infrastructure runs on AWS across multiple...  ...with AIOps - using AI/ML-based tooling for anomaly... 
    Work at office
    Immediate start
    Worldwide
    Monday to Friday
    Flexible hours

    Careers at Drata

    San Francisco, CA
    21 hours ago
  •  ...more information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site...  ...software engineering and applies them to infrastructure and operations problems. The main goals...  ...Programming in Python supported by Gen AI tooling to accelerate development of... 
    Immediate start
    Remote work
    Worldwide

    OutSystems Inc.

    San Francisco, CA
    4 days ago
  •  ...Ventures, and Index Ventures, and prominent AI visionaries and founders such as Fei‑...  ...AI. About the Role As a Sr. Staff Infrastructure Engineer at Twelve Labs, you will combine...  ...when needed. Own key tradeoffs across reliability, cost, and velocity, making pragmatic... 
    H1b
    Work at office
    Worldwide
    Visa sponsorship
    Flexible hours

    Twelve-Labs

    San Francisco, CA
    2 days ago
  • $300 per month

     ...Full time Location Type On-site Department Cloud Engineering Crusoe's mission is to...  ...can create ambitiously with AI — without sacrificing scale...  ...responsible, transformative cloud infrastructure. About This Role As a Principal Site Reliability Engineer, you will play a... 
    Full time
    Temporary work

    Epoch Biodesign

    San Francisco, CA
    4 days ago
  • $181k - $263k

     ...support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction...  ...engineering across LiveRamp's global infrastructure. This is a senior individual...  ...organizationFamiliarity with LLMs and AI-assisted development workflows, including... 
    Work from home
    Flexible hours
    Night shift

    Liveramp

    San Francisco, CA
    2 days ago
  •  ...building the next hyperscaler for AI agents. About the role You...  ...of sandboxes. Today our infrastructure runs on Nomad and Terraform across...  ...for an infrastructure engineer who actually wants to live in...  ...startup with in-person (4 days on-site, 1 day WFH) offices in San... 
    Live in
    Work from home

    E2B

    San Francisco, CA
    21 hours ago
  • LiteLLM is the world’s most popular AI Gateway used by the largest companies (Adobe,...  ...performance profiling As the SRE, you'll own the reliability and performance of the LiteLLM proxy in...  ...: Fixing OOM issues — e.g. Prisma Query Engine unable to recover from OOMKill in K8s... 
    Full time

    LiteLLM

    San Francisco, CA
    1 day ago
  • $163k - $203k

    GoTo Meeting is looking for a Senior Site Reliability Engineer in San Francisco. You will be responsible for the reliability, scalability, and security...  ...ideal candidate will mentor junior engineers and implement AI-driven operations. Benefits include a hybrid work model,... 

    GoTo Meeting

    San Francisco, CA
    4 days ago
  • $250k - $290k

     ...Harvey Harvey is a secure AI platform for legal and professional...  ...our expert team of lawyers, engineers and research scientists. We’...  ...Software Engineer on the Site Reliability team at Harvey, you will...  ...sits at the intersection of infrastructure and product, owning the systems... 
    Full time
    Relocation package

    Harvey

    San Francisco, CA
    17 hours ago
  • A technology company based in San Francisco is seeking an experienced Platform Engineer to develop user-facing features for their innovative AI Hardware platform. The role requires strong proficiency in TypeScript, Node.js, and React, along with a commitment to collaboration... 
    Remote work

    Flux Enterprise

    San Francisco, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer - AI Infrastructure. Be the first to apply!