Site Reliability Engineer - AI Infrastructure
$250kHamilton Barnes Associates Limited
Are you looking for an exciting new opportunity? Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration. Responsibilities Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimize resource scheduling, GPU utilization, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Skills / Must Have 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments. Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus. Benefits IPO Equity 10% company bonus 401K 4% match Salary $250,000 gross per year #J-18808-Ljbffr Hamilton Barnes Associates Limited
- Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only...SuggestedFull timeRemote work
- Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early‑stage startups access to the kind of scaled AI infrastructure once reserved...SuggestedFull timeRemote work
- A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong...Suggested
- ...building the category-defining AI workflow automation platform that... ...’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems... ...fully implemented and measured. Infrastructure and Platform Operations:...SuggestedWork at officeRemote workFlexible hours2 days per week
- ...security, delivering an AI-powered platform that... .... As a Staff Platform Engineer, you will play a... ...leadership role. You will own reliability for major platform... ...maintaining the shared infrastructure services and platforms... ...Platform Engineering, or Site Reliability...Suggested
$125k - $165k
A leading innovator in laboratory software is seeking a Site Reliability Engineer in San Francisco, CA. The role focuses on ensuring reliability and performance of AI systems, managing production infrastructure, and operating resilient systems in cloud environments. The...$232k - $319k
...Secure Every Identity, from AI to Human Identity is the key... ...the trusted, neutral infrastructure that enables organizations to... ...service with great people and reliable, cost-effective, and efficient... ...velocity of SRE and product engineering by developing robust platforms...Permanent employmentLocal areaWorldwideFlexible hours$227.2k - $324.5k
...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations team.... ...Partner with infra lead to align Tubi's infrastructure & SRE roadmap. Partner with tech... ...for our observability and SRE related AI platforms, work with infra lead and finance...Full timeContract workTemporary workLocal areaFlexible hours$163k - $203k
...SRE team, responsible for the reliability, scalability, and security... ...This is as much of a platform engineering role as it is SRE role — you... ....We are building an agentic AI-first operations model where... ...compute (managed by the Infrastructure Engineering team) across all...Work experience placementWork at officeLocal areaRemote workFlexible hours2 days per week$166.9k - $225.9k
...operates as both a central engineering function and an embedded reliability practice. You'll be part... ...reliability. Our infrastructure runs on AWS across multiple... ...years of experience in Site Reliability Engineering,... ...Experience with AIOps—using AI/ML‑based tooling for...Flexible hours- Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing... ...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability,...
- About HappyRobot HappyRobot is the infrastructure for enterprises to build and orchestrate AI workforces. Our AI workers don'... ...looking for an Infrastructure Engineer to take the lead on scaling our... ...role where you’ll shape how reliability is done - reducing incident load...WorldwideShift work
- ...Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands... ...day is currently Tuesday. Engineering at Lambda is responsible... ...and improve product reliability. Lead members of other engineering... ...5+ years of experience in Site Reliability Engineering...Work at officeLocal areaWork from home
- TELCOR Inc is looking for a Site Reliability Engineer to ensure the reliability, scalability, and performance of our AI products' systems. The role involves designing and operating... ...environments while managing production infrastructure and deployment workflows. The ideal...Remote job
- ...the next generation of Gen AI‑driven code reviewers: a symbiotic... ...outperforms individual engineers. We combine language models... ...are seeking an experienced Site Reliability Engineer to join our Platform... ...automation platforms, and owning the infrastructure that powers our AI‑driven...
- OutSystems, Inc. is looking for a Site Reliability Engineer to join their team in San Francisco, CA. The ideal candidate will lead the onboarding of services and teams to reliability tenets while establishing SLOs and SLAs. Proficiency in Python and experience with Kubernetes...Flexible hours
- Happyrobot Inc. is looking for an Infrastructure Engineer in San Francisco, California. This role involves leading the stability and observability... ...familiarity with monitoring tools. Join us at a high-growth AI startup backed by top investors, where you will have...
$163k - $203k
...SRE team, responsible for the reliability, scalability, and security... ...This is as much a platform engineering role as it is an SRE role— you... ...We are building an agentic AI‑first operations model where... ...based compute (managed by the Infrastructure Engineering team) across all...Work experience placementWork at officeRemote workFlexible hours2 days per week$125k - $165k
Position Site Reliability Engineer Location Lincoln, NE, San Francisco, CA, or Remote Job ID 434... ...performance of the systems that power our AI products. This role will also design... ...environments, and manage production infrastructure and deployment workflows across...Temporary workRemote workVisa sponsorshipWork visaFlexible hours- ...Connor was a machine learning research engineer at Scale AI. The rest of our team comes from... ...Senior SRE, you'll tackle the scaling and reliability challenges that come with adding... ...scale. What You'll Do Scale our data infrastructure: Optimize and extend our ClickHouse and...
$151.5k - $252.5k
Veeam is the Data and AI Trust Company, specializing in helping organizations... ...are looking for an experienced Senior Site Reliability Engineer to join the Veeam Data Cloud (VDC)... ...stack based on containers, serverless infrastructure, Golang, public cloud services in the...Base plus commissionLocal areaWorldwide- ...services and teams to the reliability tenets. Establish and maintain... ..., reliable, and secure infrastructure, ensuring cloud‑native... ...Program in Python, using Gen AI tooling to accelerate automation... ...6+ years of experience in Site Reliability Engineering, managing infrastructure...
$60 per hour
Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law firms, reshaping how data drives impactful... ...hotfixes — while also automating infrastructure, monitoring systems, and GitHub...Full timeWork at officeFlexible hours- # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission &... ...operates as both a central engineering function and an embedded reliability... ...approach reliability.Our infrastructure runs on AWS across multiple... ...with AIOps - using AI/ML-based tooling for anomaly...Work at officeImmediate startWorldwideMonday to FridayFlexible hours
- ...more information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site... ...software engineering and applies them to infrastructure and operations problems. The main goals... ...Programming in Python supported by Gen AI tooling to accelerate development of...Immediate startRemote workWorldwide
$127k - $249k
The Team Platform Engineering is the department within SRE that is... ...responsible for a range of critical infrastructure and operational functions... ...that ensure cluster reliability and security (e.g., CoreDNS,... ...redefined the database for the AI era, enabling innovators to...Work at officeLocal areaRemote workWorldwideFlexible hours- The role We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platform. You’ll be building and operating the core systems that power agentic AI at scale. Your mission: keep our ultra...
- ...changing that, using AI to disrupt a massive market... ...the role Gamma's infrastructure needs to be rock-solid... ...users while enabling our engineering teams to ship fast.... ...tooling that improves reliability and partnering with engineering... ...ll bring 5+ years in Site Reliability...Work at officeWork from home
- ...human. Heidi is building an AI Care Partner that works alongside... .... We’re a team of doctors, engineers, designers, researchers, and... ...-end. Improve operational reliability: Identify recurring issues... ...Kubernetes clusters, cloud infrastructure, and core platform services,...Work at officeWorldwide
$125k - $165k
Position: Site Reliability Engineer Location: San Francisco, CA Job Id: 434 # of Openings: 1 TELCOR Inc, a leading... ...Reliability Engineer to join our TELCOR AI Systems team! Do you have strong experience in cloud infrastructure, distributed systems and production...Temporary workWork at officeVisa sponsorshipWork visaRelocation packageFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Site Reliability Engineer - AI Infrastructure. Be the first to apply!
- site reliability engineer remote San Francisco, CA
- site reliability engineer sre San Francisco, CA
- site reliability engineer San Francisco, CA
- entry level infrastructure engineer San Francisco, CA
- infrastructure automation engineer San Francisco, CA
- security infrastructure engineer San Francisco, CA
- senior infrastructure engineer San Francisco, CA
- remote infrastructure engineer San Francisco, CA
- infrastructure engineering manager San Francisco, CA
- infrastructure engineer San Francisco, CA



