Site Reliability Engineer - AI Infrastructure

$250k

Hamilton Barnes Associates Limited

Are you looking for an exciting new opportunity? Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration. Responsibilities Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimize resource scheduling, GPU utilization, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Skills / Must Have 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments. Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus. Benefits IPO Equity 10% company bonus 401K 4% match Salary $250,000 gross per year #J-18808-Ljbffr Hamilton Barnes Associates Limited

Apply

Vacancy posted 3 days ago

Similar jobs that could be interesting for youBased on the Site Reliability Engineer - AI Infrastructure in San Francisco, CA vacancy

Site Reliability Engineer - AI Infrastructure
Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only...
Suggested
Full time
Remote work
Andromeda Cluster
San Francisco, CA
2 days ago
Senior Site Reliability Engineer AI Infrastructure
Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early‑stage startups access to the kind of scaled AI infrastructure once reserved...
Suggested
Full time
Remote work
Cortes 23
San Francisco, CA
5 days ago
Senior Site Reliability Engineer - AI Cloud & GPU Infra
A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong...
Suggested
Hyperbolic Labs
San Francisco, CA
2 days ago
Site Reliability Engineer
...building the category-defining AI workflow automation platform that... ...’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems... ...fully implemented and measured. Infrastructure and Platform Operations:...
Suggested
Work at office
Remote work
Flexible hours
2 days per week
Plenful
San Francisco, CA
3 days ago
Senior / Staff Site Reliability, Platform Engineering
...security, delivering an AI-powered platform that... .... As a Staff Platform Engineer, you will play a... ...leadership role. You will own reliability for major platform... ...maintaining the shared infrastructure services and platforms... ...Platform Engineering, or Site Reliability...
Suggested
Saviynt
San Francisco, CA
2 days ago
Cloud-Native Site Reliability Engineer | Kubernetes & AWS
$125k - $165k
A leading innovator in laboratory software is seeking a Site Reliability Engineer in San Francisco, CA. The role focuses on ensuring reliability and performance of AI systems, managing production infrastructure, and operating resilient systems in cloud environments. The...
TELCOR
San Francisco, CA
3 days ago
Senior Manager, Site Reliability Engineering - Infrastructure Platform
$232k - $319k
...Secure Every Identity, from AI to Human Identity is the key... ...the trusted, neutral infrastructure that enables organizations to... ...service with great people and reliable, cost-effective, and efficient... ...velocity of SRE and product engineering by developing robust platforms...
Permanent employment
Local area
Worldwide
Flexible hours
Okta, Inc.
San Francisco, CA
9 hours ago
Senior Manager, Site Reliability Engineering
$227.2k - $324.5k
...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations team.... ...Partner with infra lead to align Tubi's infrastructure & SRE roadmap. Partner with tech... ...for our observability and SRE related AI platforms, work with infra lead and finance...
Full time
Contract work
Temporary work
Local area
Flexible hours
Tubi
San Francisco, CA
4 days ago
Sr. Site Reliability Engineer
$163k - $203k
...SRE team, responsible for the reliability, scalability, and security... ...This is as much of a platform engineering role as it is SRE role — you... ....We are building an agentic AI-first operations model where... ...compute (managed by the Infrastructure Engineering team) across all...
Work experience placement
Work at office
Local area
Remote work
Flexible hours
2 days per week
Prosper
San Francisco, CA
24 days ago
Senior Site Reliability Engineer
$166.9k - $225.9k
...operates as both a central engineering function and an embedded reliability practice. You'll be part... ...reliability. Our infrastructure runs on AWS across multiple... ...years of experience in Site Reliability Engineering,... ...Experience with AIOps—using AI/ML‑based tooling for...
Flexible hours
Drata
San Francisco, CA
1 day ago
Hyperbolic Labs - Senior Site Reliability Engineer
Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing... ...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability,...
deCircle
San Francisco, CA
4 days ago
Site Reliability Engineer
About HappyRobot HappyRobot is the infrastructure for enterprises to build and orchestrate AI workforces. Our AI workers don'... ...looking for an Infrastructure Engineer to take the lead on scaling our... ...role where you’ll shape how reliability is done - reducing incident load...
Worldwide
Shift work
Happyrobot Inc.
San Francisco, CA
5 days ago
Senior Site Reliability Engineer - Observability
...Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands... ...day is currently Tuesday. Engineering at Lambda is responsible... ...and improve product reliability. Lead members of other engineering... ...5+ years of experience in Site Reliability Engineering...
Work at office
Local area
Work from home
Lambda
San Francisco, CA
4 days ago
Site Reliability Engineer — AI Systems (Remote)
TELCOR Inc is looking for a Site Reliability Engineer to ensure the reliability, scalability, and performance of our AI products' systems. The role involves designing and operating... ...environments while managing production infrastructure and deployment workflows. The ideal...
Remote job
TELCOR Inc
San Francisco, CA
3 days ago
Site Reliability Engineer
...the next generation of Gen AI‑driven code reviewers: a symbiotic... ...outperforms individual engineers. We combine language models... ...are seeking an experienced Site Reliability Engineer to join our Platform... ...automation platforms, and owning the infrastructure that powers our AI‑driven...
CodeRabbit
San Francisco, CA
5 days ago
Senior Site Reliability Engineer - AI-Driven, Scalable Infra
OutSystems, Inc. is looking for a Site Reliability Engineer to join their team in San Francisco, CA. The ideal candidate will lead the onboarding of services and teams to reliability tenets while establishing SLOs and SLAs. Proficiency in Python and experience with Kubernetes...
Flexible hours
OutSystems, Inc.
San Francisco, CA
5 days ago
Site Reliability Engineer — Scale AI Infra with Ownership
Happyrobot Inc. is looking for an Infrastructure Engineer in San Francisco, California. This role involves leading the stability and observability... ...familiarity with monitoring tools. Join us at a high-growth AI startup backed by top investors, where you will have...
Happyrobot Inc.
San Francisco, CA
5 days ago
Sr. Site Reliability Engineer
$163k - $203k
...SRE team, responsible for the reliability, scalability, and security... ...This is as much a platform engineering role as it is an SRE role— you... ...We are building an agentic AI‑first operations model where... ...based compute (managed by the Infrastructure Engineering team) across all...
Work experience placement
Work at office
Remote work
Flexible hours
2 days per week
GoTo Meeting
San Francisco, CA
5 days ago
Site Reliability Engineer
$125k - $165k
Position Site Reliability Engineer Location Lincoln, NE, San Francisco, CA, or Remote Job ID 434... ...performance of the systems that power our AI products. This role will also design... ...environments, and manage production infrastructure and deployment workflows across...
Temporary work
Remote work
Visa sponsorship
Work visa
Flexible hours
TELCOR Inc
San Francisco, CA
3 days ago
Senior Site Reliability Engineer
...Connor was a machine learning research engineer at Scale AI. The rest of our team comes from... ...Senior SRE, you'll tackle the scaling and reliability challenges that come with adding... ...scale. What You'll Do Scale our data infrastructure: Optimize and extend our ClickHouse and...
Unify
San Francisco, CA
5 days ago
Site Reliability Engineer III
$151.5k - $252.5k
Veeam is the Data and AI Trust Company, specializing in helping organizations... ...are looking for an experienced Senior Site Reliability Engineer to join the Veeam Data Cloud (VDC)... ...stack based on containers, serverless infrastructure, Golang, public cloud services in the...
Base plus commission
Local area
Worldwide
Veeam
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...services and teams to the reliability tenets. Establish and maintain... ..., reliable, and secure infrastructure, ensuring cloud‑native... ...Program in Python, using Gen AI tooling to accelerate automation... ...6+ years of experience in Site Reliability Engineering, managing infrastructure...
OutSystems, Inc.
San Francisco, CA
5 days ago
Senior Site Reliability Engineer
$60 per hour
Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law firms, reshaping how data drives impactful... ...hotfixes — while also automating infrastructure, monitoring systems, and GitHub...
Full time
Work at office
Flexible hours
Bonfirevc
San Francisco, CA
5 days ago
Senior Site Reliability Engineer
# Senior Site Reliability EngineerHybrid - San Francisco**Our Mission &... ...operates as both a central engineering function and an embedded reliability... ...approach reliability.Our infrastructure runs on AWS across multiple... ...with AIOps - using AI/ML-based tooling for anomaly...
Work at office
Immediate start
Worldwide
Monday to Friday
Flexible hours
Careers at Drata
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...more information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site... ...software engineering and applies them to infrastructure and operations problems. The main goals... ...Programming in Python supported by Gen AI tooling to accelerate development of...
Immediate start
Remote work
Worldwide
OutSystems Inc.
San Francisco, CA
5 days ago
Senior Site Reliability Engineer, Fleet Management
$127k - $249k
The Team Platform Engineering is the department within SRE that is... ...responsible for a range of critical infrastructure and operational functions... ...that ensure cluster reliability and security (e.g., CoreDNS,... ...redefined the database for the AI era, enabling innovators to...
Work at office
Local area
Remote work
Worldwide
Flexible hours
MongoDB
San Francisco, CA
2 days ago
Site Reliability Engineer
The role We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platform. You’ll be building and operating the core systems that power agentic AI at scale. Your mission: keep our ultra...
Blaxel
San Francisco, CA
1 day ago
Site Reliability Engineer
...changing that, using AI to disrupt a massive market... ...the role Gamma's infrastructure needs to be rock-solid... ...users while enabling our engineering teams to ship fast.... ...tooling that improves reliability and partnering with engineering... ...ll bring 5+ years in Site Reliability...
Work at office
Work from home
gamma.app
San Francisco, CA
3 days ago
Senior Site Reliability Engineer (Upmarket)
...human. Heidi is building an AI Care Partner that works alongside... .... We’re a team of doctors, engineers, designers, researchers, and... ...-end. Improve operational reliability: Identify recurring issues... ...Kubernetes clusters, cloud infrastructure, and core platform services,...
Work at office
Worldwide
Heidi Health Ltd
San Francisco, CA
5 days ago
Site Reliability Engineer
$125k - $165k
Position: Site Reliability Engineer Location: San Francisco, CA Job Id: 434 # of Openings: 1 TELCOR Inc, a leading... ...Reliability Engineer to join our TELCOR AI Systems team! Do you have strong experience in cloud infrastructure, distributed systems and production...
Temporary work
Work at office
Visa sponsorship
Work visa
Relocation package
Flexible hours
TELCOR
San Francisco, CA
3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer - AI Infrastructure. Be the first to apply!