Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Site Reliability Engineer (SRE) - AI Inftastructure

$300k

Hamilton Barnes Associates Limited

Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. As well as supporting their extremely exciting new products coming to the market! This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Get in touch and apply today! Responsibilities: Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Skills / Must Have: 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments. Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus. Salary & Benefits: $300,000 gross per year Equity #J-18808-Ljbffr Hamilton Barnes Associates Limited

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer (SRE) - AI Inftastructure in San Francisco, CA vacancy
  •  ...Site Reliability Engineer (SRE) FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge... 
    Suggested
    Work at office
    Weekend work

    Fluix AI

    San Francisco, CA
    3 days ago
  • $170k - $230k

     ...Site Reliability Engineer (SRE) Palo Alto / San Francisco Bay Area About Mithril Mithril is an AI infrastructure platform built to make GPU compute more accessible and affordable for the world's leading enterprises, AI startups, and the AI research community,... 
    Suggested
    Work at office
    Local area
    1 day per week

    Mithril

    San Francisco, CA
    3 days ago
  • $350k

     ...Site Reliability Engineer (SRE) San Francisco Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence...  ...everyone has access to the knowledge and tools to make AI work for their unique needs and goals. We are scientists... 
    Suggested
    Local area
    Visa sponsorship
    Work visa
    Relocation package

    Thinking Machines Lab

    San Francisco, CA
    4 days ago
  •  ...family-founded company on a mission to create the world's first AI-powered Personal & Entrepreneurial Resource Planner (PRP),...  ...-and change lives along the way. The Role As a Site Reliability Engineer (SRE) at Air Apps, you will be responsible for ensuring the... 
    Suggested
    Temporary work
    Worldwide

    Air Apps

    San Francisco, CA
    19 hours ago
  •  ...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain...  ...cloud-based systems operations, as a SRE or DevOps engineer. ~ First-hand...  ...code Experience experimenting with AI-driven approaches to operations... 
    Senior

    TechChain Talent

    San Francisco, CA
    4 days ago
  •  ...databases to data warehouses, lakes, and AI applications. With tens of thousands of...  ...Role You'll be the infrastructure and reliability engineer on the Data Replication team - a full-...  ...in infrastructure, platform engineering, SRE, or DevOps. ~ Hands-on ownership of... 
    Senior
    Local area

    Airbyte

    San Francisco, CA
    4 days ago
  •  ...create the next generation of Gen AI-driven code reviewers: a...  ...significantly outperforms individual engineers. We combine language models...  ...are seeking an experienced Site Reliability Engineer to join our Platform...  ...services reliably. As an SRE at CodeRabbit, you'll be... 
    Senior

    CodeRabbit

    San Francisco, CA
    19 hours ago
  • $195k - $240k

     ...Senior Site Reliability Engineer San Francisco (Hybrid) At You.com, we are building the AI Search Infrastructure that powers modern AI systems. Our goal is to create the trusted...  ...is measurable. Develop and maintain SRE standards and patterns (instrumentation guidelines... 
    Senior
    Full time
    Immediate start
    Remote work
    Work from home
    Flexible hours

    Y.O.U.

    San Francisco, CA
    3 days ago
  • $127k - $249k

     ...The Team Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure...  ...components that ensure cluster reliability and security (e.g., CoreDNS, cert-...  ...the data platform for the AI era, enabling builders to create, transform... 
    Senior
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    San Francisco, CA
    4 days ago
  • $166.9k - $225.9k

     ...Job Summary: Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a...  ...~6+ years of experience in Site Reliability Engineering, Cloud...  ...Experience with AIOps - using AI/ML-based tooling for anomaly... 
    Senior
    Work at office
    Immediate start
    Worldwide
    Monday to Friday
    Flexible hours

    Drata Inc

    San Francisco, CA
    19 hours ago
  • $220k - $235k

     ...Staff/Senior Staff Site Reliability Engineer Ironclad is the leading AI contracting platform that transforms agreements into assets. Contracts move faster, insights...  ...seeking a strategic, high-output Staff/Senior Staff SRE to define the future of our cloud platform and... 
    Senior
    Full time
    Contract work
    Work at office

    Ironclad Inc

    San Francisco, CA
    19 hours ago
  • We are seeking a Sr. Site Reliability Engineer to join our team and run critical infrastructure for our blockchain and web applications. You’ll learn...  ...tools to streamline development processes. DevOps Engineer/SRE Transitioning to Blockchain An experienced DevOps Engineer... 
    Senior
    Remote job

    Blockchain Works

    San Francisco, CA
    11 days ago
  •  ...deeply human. Heidi is building an AI Care Partner that works...  ...possible. We’re a team of doctors, engineers, designers, researchers, and...  ...-to-end. Improve operational reliability: Identify recurring issues...  ...re looking for 3-6+ years in SRE, DevOps, Platform, or operations... 
    Senior
    Work at office
    Worldwide

    Heidi Health Ltd

    San Francisco, CA
    4 days ago
  •  ...acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from companies like...  ...go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of... 
    Senior

    Unify

    San Francisco, CA
    19 hours ago
  • $181k - $263k

     ...privacy requirements. The Global SRE team is responsible for owning and supporting...  ...support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction...  ...organization Familiarity with LLMs and AI-assisted development workflows,... 
    Senior
    Work from home
    Flexible hours
    Night shift

    LiveRamp

    San Francisco, CA
    19 hours ago
  •  ...information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site Reliability Engineerlocations:...  ...Function Site Reliability Engineering (SRE) is a discipline that incorporates...  ...in Python supported by Gen AI tooling to accelerate development of... 
    Senior
    Immediate start
    Remote work
    Worldwide

    OutSystems Inc.

    San Francisco, CA
    4 days ago
  • $166.9k - $225.9k

    Job Summary Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit...  ...6+ years of experience in Site Reliability Engineering, Cloud...  ...Experience with AIOps—using AI/ML‑based tooling for anomaly... 
    Senior
    Flexible hours

    Drata

    San Francisco, CA
    19 hours ago
  • CloudDevs: Senior Web site Reliability Engineer (SRE) CloudDevs works with fast-moving, venture-backed startups throughout the US. We’re constructing a pool of world-class Web site Reliability Engineers for present roles and for upcoming alternatives. You’ll both be positioned... 
    Senior

    The10minutecareersolution

    San Francisco, CA
    1 day ago
  • $163k - $203k

    GoTo Meeting is looking for a Senior Site Reliability Engineer in San Francisco. You will be responsible for the reliability, scalability, and security...  ...candidate will mentor junior engineers and implement AI-driven operations. Benefits include a hybrid work model, competitive... 
    Senior

    GoTo Meeting

    San Francisco, CA
    4 days ago
  • An innovative R&D company in San Francisco is seeking a Site Reliability Engineer to join its Platform Engineering team. This position focuses on ensuring the reliability and performance of an AI-powered code review platform. The ideal candidate will have 6-8 years of experience... 
    Senior

    CodeRabbit

    San Francisco, CA
    2 days ago
  • Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by...  ...and engineering. The Role This is not a generalist SRE role. You will design, operate, and debug large‑scale GPU... 
    Senior
    Full time
    Remote work

    Cortes 23

    San Francisco, CA
    4 days ago
  • $232k - $319k

     ...Secure Every Identity, from AI to Human Identity is the key to...  ...service with great people and reliable, cost-effective, and efficient...  ...and various initiatives across SRE & Infrastructure organization....  ...velocity of SRE and product engineering by developing robust platforms... 
    Senior
    Permanent employment
    Local area
    Worldwide
    Flexible hours

    Okta, Inc.

    San Francisco, CA
    4 days ago
  • $202.8k - $327.63k

     ...management (CLM). What you’ll do The Senior Director, SRE Platform Engineering is a senior engineering leader...  ...IT Service Management (ITSM) and Site Reliability Engineering (SRE) capabilities, applying...  ...lead teams that deliver secure, AI‑driven, and intuitive experiences... 
    Senior
    Permanent employment
    Contract work
    Work at office
    Local area
    Remote work
    2 days per week

    DocuSign, Inc.

    San Francisco, CA
    3 days ago
  • Senior Infrastructure Engineer - Bland As a Senior Infrastructure Engineer at Bland,...  ...processing with strict latency and reliability requirements; building and...  ...industries. Lead - AI/ML Stack Infrastructure Lead...  ...global deployments. Work with Site Reliability Engineering to... 
    Senior
    Temporary work

    AI Chopping Block, Inc.

    San Francisco, CA
    1 day ago
  •  ...We Are Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open...  ...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure... 
    Senior

    Hyperbolic Labs

    San Francisco, CA
    3 days ago
  • $160k - $250k

     ...machine learning models, we also need to grow our DevOps and Site Reliability team to maintain the reliability of our enterprise SaaS offering...  ...individuals who are passionate about creating a revolutionary AI company. At Hive, you will have a steep learning curve and an... 
    Senior

    Hive

    San Francisco, CA
    3 days ago
  • $200k - $250k

     ...platform that combines modern web tooling with AI-powered workflows. Our stack includes React/...  ...based production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role Mission... 
    Senior
    Permanent employment

    Vizcom

    San Francisco, CA
    4 days ago
  •  ...Udaip Cloud-Based Data And Ai Platform Engineer At U.S. Bank, we're on a journey to do our best. Helping the customers and businesses we serve to make better and smarter financial decisions and enabling the communities we support to grow and succeed. We believe it... 
    Senior
    Temporary work
    Work experience placement

    Phenom People

    San Francisco, CA
    3 days ago
  • $181.69k - $213.75k

     ...Senior Site Reliability Engineer San Francisco, California; Santa Clara, California; Seattle, WA The Company You'll Join Carta connects founders...  ...of RESTful and/or GraphQL API design principles. AI Fluency: You use AI tools in your own day-to-day work in... 
    Senior
    Full time
    Work at office

    Carta

    San Francisco, CA
    3 days ago
  • A leading language learning platform is seeking an experienced SRE Engineer to ensure the reliability and resilience of their infrastructure. Responsibilities include leading incident response, improving observability, and collaborating with various teams to enhance platform... 
    Senior

    Speak

    San Francisco, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer (SRE) - AI Inftastructure. Be the first to apply!