Site Reliability Engineer - AI Infrastructure
$250kHamilton Barnes Associates Limited
Are you looking for an exciting new opportunity? Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration. Responsibilities Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimize resource scheduling, GPU utilization, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Skills / Must Have 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments. Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus. Benefits IPO Equity 10% company bonus 401K 4% match Salary $250,000 gross per year #J-18808-Ljbffr Hamilton Barnes Associates Limited
- Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only...SuggestedFull timeRemote work
- A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong...Suggested
$125k - $165k
A leading innovator in laboratory software is seeking a Site Reliability Engineer in San Francisco, CA. The role focuses on ensuring reliability and performance of AI systems, managing production infrastructure, and operating resilient systems in cloud environments. The...Suggested- ...security, delivering an AI-powered platform that... .... As a Staff Platform Engineer, you will play a... ...leadership role. You will own reliability for major platform... ...maintaining the shared infrastructure services and platforms... ...Platform Engineering, or Site Reliability...Suggested
$238k - $290k
...operate. By combining frontier agentic AI, an enterprise-grade platform, and... ...Overview As a Staff Software Engineer on the Site Reliability team at Harvey, you will ensure the... ...team that sits at the intersection of infrastructure and product, owning the systems that...SuggestedRelocation package$125k - $165k
Position: Site Reliability Engineer Location: San Francisco, CA Job Id: 434 # of Openings: 1 TELCOR Inc, a leading... ...Reliability Engineer to join our TELCOR AI Systems team! Do you have strong experience in cloud infrastructure, distributed systems and production...Temporary workWork at officeVisa sponsorshipWork visaRelocation packageFlexible hours$163k - $203k
...SRE team, responsible for the reliability, scalability, and security... ...This is as much of a platform engineering role as it is SRE role — you... ...We are building an agentic AI‑first operations model where... ...compute (managed by the Infrastructure Engineering team) across all...Work experience placementWork at officeLocal areaRemote workFlexible hours2 days per week- ...changing that, using AI to disrupt a massive market... ...the role Gamma's infrastructure needs to be rock-solid... ...users while enabling our engineering teams to ship fast.... ...tooling that improves reliability and partnering with engineering... ...ll bring 5+ years in Site Reliability...Work at officeWork from home
- ...human. Heidi is building an AI Care Partner that works alongside... .... We’re a team of doctors, engineers, designers, researchers, and... ...-end. Improve operational reliability: Identify recurring issues... ...Kubernetes clusters, cloud infrastructure, and core platform services,...Work at officeWorldwide
- TELCOR Inc is looking for a Site Reliability Engineer to ensure the reliability, scalability, and performance of our AI products' systems. The role involves designing and operating... ...environments while managing production infrastructure and deployment workflows. The ideal...Remote job
$125k - $165k
Position Site Reliability Engineer Location Lincoln, NE, San Francisco, CA, or Remote Job ID 434... ...performance of the systems that power our AI products. This role will also design... ...environments, and manage production infrastructure and deployment workflows across...Temporary workRemote workVisa sponsorshipWork visaFlexible hours- ...Connor was a machine learning research engineer at Scale AI. The rest of our team comes from... ...Senior SRE, you'll tackle the scaling and reliability challenges that come with adding... ...scale. What You'll Do Scale our data infrastructure: Optimize and extend our ClickHouse and...
- Happyrobot Inc. is looking for an Infrastructure Engineer in San Francisco, California. This role involves leading the stability and observability... ...familiarity with monitoring tools. Join us at a high-growth AI startup backed by top investors, where you will have...
$165k - $225k
...ecosystem. SDF is looking for a Senior Site Reliability Engineer to help build and operate the... ...our systems, design and improve the infrastructure behind our production environments, and... ...source code Experience experimenting with AI-driven approaches to operations Compensation...Temporary workWork at officeLocal areaWorldwideFlexible hours$151.5k - $252.5k
Veeam is the Data and AI Trust Company, specializing in helping organizations... ...are looking for an experienced Senior Site Reliability Engineer to join the Veeam Data Cloud (VDC)... ...stack based on containers, serverless infrastructure, Golang, public cloud services in the...Base plus commissionLocal areaWorldwide- ...building the category-defining AI workflow automation platform that... ...’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems... ...fully implemented and measured. Infrastructure and Platform Operations:...Work at officeRemote workFlexible hours2 days per week
- Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing... ...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability,...
$166.9k - $225.9k
...operates as both a central engineering function and an embedded reliability practice. You'll be part... ...reliability. Our infrastructure runs on AWS across multiple... ...years of experience in Site Reliability Engineering,... ...Experience with AIOps—using AI/ML‑based tooling for...Flexible hours- About HappyRobot HappyRobot is the infrastructure for enterprises to build and orchestrate AI workforces. Our AI workers don'... ...looking for an Infrastructure Engineer to take the lead on scaling our... ...role where you’ll shape how reliability is done - reducing incident load...WorldwideShift work
$140.3k - $191.55k
...organizations with a goal of using the latest AI, GenAI, LLM, Cloud, and Digital... ...and regulatory paperwork. Site Reliability Engineer Location: Atlanta, GA; Miami, FL; Cambridge... ...maintenance of applications or systems infrastructure for large-scale customer-facing companies...Temporary workWork experience placement- # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission &... ...operates as both a central engineering function and an embedded reliability... ...approach reliability.Our infrastructure runs on AWS across multiple... ...with AIOps - using AI/ML-based tooling for anomaly...Work at officeImmediate startWorldwideMonday to FridayFlexible hours
- ...more information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site... ...software engineering and applies them to infrastructure and operations problems. The main goals... ...Programming in Python supported by Gen AI tooling to accelerate development of...Immediate startRemote workWorldwide
- ...Ventures, and Index Ventures, and prominent AI visionaries and founders such as Fei‑... ...AI. About the Role As a Sr. Staff Infrastructure Engineer at Twelve Labs, you will combine... ...when needed. Own key tradeoffs across reliability, cost, and velocity, making pragmatic...H1bWork at officeWorldwideVisa sponsorshipFlexible hours
$300 per month
...Full time Location Type On-site Department Cloud Engineering Crusoe's mission is to... ...can create ambitiously with AI — without sacrificing scale... ...responsible, transformative cloud infrastructure. About This Role As a Principal Site Reliability Engineer, you will play a...Full timeTemporary work$181k - $263k
...support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction... ...engineering across LiveRamp's global infrastructure. This is a senior individual... ...organizationFamiliarity with LLMs and AI-assisted development workflows, including...Work from homeFlexible hoursNight shift- ...building the next hyperscaler for AI agents. About the role You... ...of sandboxes. Today our infrastructure runs on Nomad and Terraform across... ...for an infrastructure engineer who actually wants to live in... ...startup with in-person (4 days on-site, 1 day WFH) offices in San...Live inWork from home
- LiteLLM is the world’s most popular AI Gateway used by the largest companies (Adobe,... ...performance profiling As the SRE, you'll own the reliability and performance of the LiteLLM proxy in... ...: Fixing OOM issues — e.g. Prisma Query Engine unable to recover from OOMKill in K8s...Full time
$163k - $203k
GoTo Meeting is looking for a Senior Site Reliability Engineer in San Francisco. You will be responsible for the reliability, scalability, and security... ...ideal candidate will mentor junior engineers and implement AI-driven operations. Benefits include a hybrid work model,...$250k - $290k
...Harvey Harvey is a secure AI platform for legal and professional... ...our expert team of lawyers, engineers and research scientists. We’... ...Software Engineer on the Site Reliability team at Harvey, you will... ...sits at the intersection of infrastructure and product, owning the systems...Full timeRelocation package- A technology company based in San Francisco is seeking an experienced Platform Engineer to develop user-facing features for their innovative AI Hardware platform. The role requires strong proficiency in TypeScript, Node.js, and React, along with a commitment to collaboration...Remote work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Site Reliability Engineer - AI Infrastructure. Be the first to apply!
- site reliability engineer San Francisco, CA
- site reliability engineer remote San Francisco, CA
- site reliability engineer sre San Francisco, CA
- data infrastructure engineer San Francisco, CA
- infrastructure engineering manager San Francisco, CA
- remote infrastructure engineer San Francisco, CA
- principal infrastructure engineer San Francisco, CA
- senior infrastructure engineer San Francisco, CA
- security infrastructure engineer San Francisco, CA
- lead infrastructure engineer San Francisco, CA


