Senior Site Reliability Engineer (SRE) - AI Inftastructure

$300k

Hamilton Barnes Associates Limited

Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. As well as supporting their extremely exciting new products coming to the market! This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Get in touch and apply today! Responsibilities: Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Skills / Must Have: 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments. Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus. Salary & Benefits: $300,000 gross per year Equity #J-18808-Ljbffr Hamilton Barnes Associates Limited

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer (SRE) - AI Inftastructure in San Francisco, CA vacancy

Site Reliability Engineer (SRE)
...Site Reliability Engineer (SRE) FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge...
Suggested
Work at office
Weekend work
Fluix AI
San Francisco, CA
3 days ago
Site Reliability Engineer (SRE)
$170k - $230k
...Site Reliability Engineer (SRE) Palo Alto / San Francisco Bay Area About Mithril Mithril is an AI infrastructure platform built to make GPU compute more accessible and affordable for the world's leading enterprises, AI startups, and the AI research community,...
Suggested
Work at office
Local area
1 day per week
Mithril
San Francisco, CA
3 days ago
Site Reliability Engineer (SRE)
$350k
...Site Reliability Engineer (SRE) San Francisco Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence... ...everyone has access to the knowledge and tools to make AI work for their unique needs and goals. We are scientists...
Suggested
Local area
Visa sponsorship
Work visa
Relocation package
Thinking Machines Lab
San Francisco, CA
4 days ago
Site Reliability Engineer (SRE)
...family-founded company on a mission to create the world's first AI-powered Personal & Entrepreneurial Resource Planner (PRP),... ...-and change lives along the way. The Role As a Site Reliability Engineer (SRE) at Air Apps, you will be responsible for ensuring the...
Suggested
Temporary work
Worldwide
Air Apps
San Francisco, CA
19 hours ago
Senior Site Reliability Engineer
...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain... ...cloud-based systems operations, as a SRE or DevOps engineer. ~ First-hand... ...code Experience experimenting with AI-driven approaches to operations...
Senior
TechChain Talent
San Francisco, CA
4 days ago
Senior Site Reliability Engineer
...databases to data warehouses, lakes, and AI applications. With tens of thousands of... ...Role You'll be the infrastructure and reliability engineer on the Data Replication team - a full-... ...in infrastructure, platform engineering, SRE, or DevOps. ~ Hands-on ownership of...
Senior
Local area
Airbyte
San Francisco, CA
4 days ago
Senior Site Reliability Engineer
...create the next generation of Gen AI-driven code reviewers: a... ...significantly outperforms individual engineers. We combine language models... ...are seeking an experienced Site Reliability Engineer to join our Platform... ...services reliably. As an SRE at CodeRabbit, you'll be...
Senior
CodeRabbit
San Francisco, CA
19 hours ago
Senior Site Reliability Engineer
$195k - $240k
...Senior Site Reliability Engineer San Francisco (Hybrid) At You.com, we are building the AI Search Infrastructure that powers modern AI systems. Our goal is to create the trusted... ...is measurable. Develop and maintain SRE standards and patterns (instrumentation guidelines...
Senior
Full time
Immediate start
Remote work
Work from home
Flexible hours
Y.O.U.
San Francisco, CA
3 days ago
Senior Site Reliability Engineer, Fleet Management
$127k - $249k
...The Team Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure... ...components that ensure cluster reliability and security (e.g., CoreDNS, cert-... ...the data platform for the AI era, enabling builders to create, transform...
Senior
Work at office
Local area
Remote work
Worldwide
Flexible hours
MongoDB
San Francisco, CA
4 days ago
Senior Site Reliability Engineer
$166.9k - $225.9k
...Job Summary: Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a... ...~6+ years of experience in Site Reliability Engineering, Cloud... ...Experience with AIOps - using AI/ML-based tooling for anomaly...
Senior
Work at office
Immediate start
Worldwide
Monday to Friday
Flexible hours
Drata Inc
San Francisco, CA
19 hours ago
Senior Staff Site Reliability Engineer
$220k - $235k
...Staff/Senior Staff Site Reliability Engineer Ironclad is the leading AI contracting platform that transforms agreements into assets. Contracts move faster, insights... ...seeking a strategic, high-output Staff/Senior Staff SRE to define the future of our cloud platform and...
Senior
Full time
Contract work
Work at office
Ironclad Inc
San Francisco, CA
19 hours ago
Remote Senior Site Reliability Engineer (SRE) - Zetachain
We are seeking a Sr. Site Reliability Engineer to join our team and run critical infrastructure for our blockchain and web applications. You’ll learn... ...tools to streamline development processes. DevOps Engineer/SRE Transitioning to Blockchain An experienced DevOps Engineer...
Senior
Remote job
Blockchain Works
San Francisco, CA
11 days ago
Senior Site Reliability Engineer (Upmarket)
...deeply human. Heidi is building an AI Care Partner that works... ...possible. We’re a team of doctors, engineers, designers, researchers, and... ...-to-end. Improve operational reliability: Identify recurring issues... ...re looking for 3-6+ years in SRE, DevOps, Platform, or operations...
Senior
Work at office
Worldwide
Heidi Health Ltd
San Francisco, CA
4 days ago
Senior Site Reliability Engineer
...acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from companies like... ...go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of...
Senior
Unify
San Francisco, CA
19 hours ago
Senior Staff Site Reliability Engineer
$181k - $263k
...privacy requirements. The Global SRE team is responsible for owning and supporting... ...support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction... ...organization Familiarity with LLMs and AI-assisted development workflows,...
Senior
Work from home
Flexible hours
Night shift
LiveRamp
San Francisco, CA
19 hours ago
Senior Site Reliability Engineer
...information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site Reliability Engineerlocations:... ...Function Site Reliability Engineering (SRE) is a discipline that incorporates... ...in Python supported by Gen AI tooling to accelerate development of...
Senior
Immediate start
Remote work
Worldwide
OutSystems Inc.
San Francisco, CA
4 days ago
Senior Site Reliability Engineer
$166.9k - $225.9k
Job Summary Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit... ...6+ years of experience in Site Reliability Engineering, Cloud... ...Experience with AIOps—using AI/ML‑based tooling for anomaly...
Senior
Flexible hours
Drata
San Francisco, CA
19 hours ago
CloudDevs: Senior Web site Reliability Engineer (SRE)
CloudDevs: Senior Web site Reliability Engineer (SRE) CloudDevs works with fast-moving, venture-backed startups throughout the US. We’re constructing a pool of world-class Web site Reliability Engineers for present roles and for upcoming alternatives. You’ll both be positioned...
Senior
The10minutecareersolution
San Francisco, CA
1 day ago
Senior SRE & Platform Engineer for AI-Driven Ops
$163k - $203k
GoTo Meeting is looking for a Senior Site Reliability Engineer in San Francisco. You will be responsible for the reliability, scalability, and security... ...candidate will mentor junior engineers and implement AI-driven operations. Benefits include a hybrid work model, competitive...
Senior
GoTo Meeting
San Francisco, CA
4 days ago
Senior SRE Platform Engineer for AI-Powered Code Review
An innovative R&D company in San Francisco is seeking a Site Reliability Engineer to join its Platform Engineering team. This position focuses on ensuring the reliability and performance of an AI-powered code review platform. The ideal candidate will have 6-8 years of experience...
Senior
CodeRabbit
San Francisco, CA
2 days ago
Senior Site Reliability Engineer AI Infrastructure
Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by... ...and engineering. The Role This is not a generalist SRE role. You will design, operate, and debug large‑scale GPU...
Senior
Full time
Remote work
Cortes 23
San Francisco, CA
4 days ago
Senior Manager, Site Reliability Engineering - Infrastructure Platform
$232k - $319k
...Secure Every Identity, from AI to Human Identity is the key to... ...service with great people and reliable, cost-effective, and efficient... ...and various initiatives across SRE & Infrastructure organization.... ...velocity of SRE and product engineering by developing robust platforms...
Senior
Permanent employment
Local area
Worldwide
Flexible hours
Okta, Inc.
San Francisco, CA
4 days ago
Sr. Director, SRE Platform Engineering
$202.8k - $327.63k
...management (CLM). What you’ll do The Senior Director, SRE Platform Engineering is a senior engineering leader... ...IT Service Management (ITSM) and Site Reliability Engineering (SRE) capabilities, applying... ...lead teams that deliver secure, AI‑driven, and intuitive experiences...
Senior
Permanent employment
Contract work
Work at office
Local area
Remote work
2 days per week
DocuSign, Inc.
San Francisco, CA
3 days ago
Senior AI/ML Infra & SRE Engineer
Senior Infrastructure Engineer - Bland As a Senior Infrastructure Engineer at Bland,... ...processing with strict latency and reliability requirements; building and... ...industries. Lead - AI/ML Stack Infrastructure Lead... ...global deployments. Work with Site Reliability Engineering to...
Senior
Temporary work
AI Chopping Block, Inc.
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...We Are Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open... ...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure...
Senior
Hyperbolic Labs
San Francisco, CA
3 days ago
Senior Site Reliability Engineer
$160k - $250k
...machine learning models, we also need to grow our DevOps and Site Reliability team to maintain the reliability of our enterprise SaaS offering... ...individuals who are passionate about creating a revolutionary AI company. At Hive, you will have a steep learning curve and an...
Senior
Hive
San Francisco, CA
3 days ago
Senior Platform & Reliability Engineer (SRE)
$200k - $250k
...platform that combines modern web tooling with AI-powered workflows. Our stack includes React/... ...based production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role Mission...
Senior
Permanent employment
Vizcom
San Francisco, CA
4 days ago
Senior Software Engineer - Site Reliability Engineering
...Udaip Cloud-Based Data And Ai Platform Engineer At U.S. Bank, we're on a journey to do our best. Helping the customers and businesses we serve to make better and smarter financial decisions and enabling the communities we support to grow and succeed. We believe it...
Senior
Temporary work
Work experience placement
Phenom People
San Francisco, CA
3 days ago
Senior Site Reliability Engineer
$181.69k - $213.75k
...Senior Site Reliability Engineer San Francisco, California; Santa Clara, California; Seattle, WA The Company You'll Join Carta connects founders... ...of RESTful and/or GraphQL API design principles. AI Fluency: You use AI tools in your own day-to-day work in...
Senior
Full time
Work at office
Carta
San Francisco, CA
3 days ago
Senior SRE Engineer: Scale & Reliability (Kubernetes/GCP)
A leading language learning platform is seeking an experienced SRE Engineer to ensure the reliability and resilience of their infrastructure. Responsibilities include leading incident response, improving observability, and collaborating with various teams to enhance platform...
Senior
Speak
San Francisco, CA
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer (SRE) - AI Inftastructure. Be the first to apply!