Senior Site Reliability Engineer (SRE) - AI Inftastructure
$300kHamilton Barnes Associates Limited
Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. As well as supporting their extremely exciting new products coming to the market! This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Get in touch and apply today! Responsibilities: Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads. Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments. Develop observability, alerting, and auto-healing systems for high-availability GPU workloads. Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow. Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes. Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput. Skills / Must Have: 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments. Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management. Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred). Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning. Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks. Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale. Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus. Salary & Benefits: $300,000 gross per year Equity #J-18808-Ljbffr Hamilton Barnes Associates Limited
- ...Site Reliability Engineer (SRE) FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge...SuggestedWork at officeWeekend work
$170k - $230k
...Site Reliability Engineer (SRE) Palo Alto / San Francisco Bay Area About Mithril Mithril is an AI infrastructure platform built to make GPU compute more accessible and affordable for the world's leading enterprises, AI startups, and the AI research community,...SuggestedWork at officeLocal area1 day per week$350k
...Site Reliability Engineer (SRE) San Francisco Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence... ...everyone has access to the knowledge and tools to make AI work for their unique needs and goals. We are scientists...SuggestedLocal areaVisa sponsorshipWork visaRelocation package- ...family-founded company on a mission to create the world's first AI-powered Personal & Entrepreneurial Resource Planner (PRP),... ...-and change lives along the way. The Role As a Site Reliability Engineer (SRE) at Air Apps, you will be responsible for ensuring the...SuggestedTemporary workWorldwide
- ...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain... ...cloud-based systems operations, as a SRE or DevOps engineer. ~ First-hand... ...code Experience experimenting with AI-driven approaches to operations...Senior
- ...databases to data warehouses, lakes, and AI applications. With tens of thousands of... ...Role You'll be the infrastructure and reliability engineer on the Data Replication team - a full-... ...in infrastructure, platform engineering, SRE, or DevOps. ~ Hands-on ownership of...SeniorLocal area
- ...create the next generation of Gen AI-driven code reviewers: a... ...significantly outperforms individual engineers. We combine language models... ...are seeking an experienced Site Reliability Engineer to join our Platform... ...services reliably. As an SRE at CodeRabbit, you'll be...Senior
$195k - $240k
...Senior Site Reliability Engineer San Francisco (Hybrid) At You.com, we are building the AI Search Infrastructure that powers modern AI systems. Our goal is to create the trusted... ...is measurable. Develop and maintain SRE standards and patterns (instrumentation guidelines...SeniorFull timeImmediate startRemote workWork from homeFlexible hours$127k - $249k
...The Team Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure... ...components that ensure cluster reliability and security (e.g., CoreDNS, cert-... ...the data platform for the AI era, enabling builders to create, transform...SeniorWork at officeLocal areaRemote workWorldwideFlexible hours$166.9k - $225.9k
...Job Summary: Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a... ...~6+ years of experience in Site Reliability Engineering, Cloud... ...Experience with AIOps - using AI/ML-based tooling for anomaly...SeniorWork at officeImmediate startWorldwideMonday to FridayFlexible hours$220k - $235k
...Staff/Senior Staff Site Reliability Engineer Ironclad is the leading AI contracting platform that transforms agreements into assets. Contracts move faster, insights... ...seeking a strategic, high-output Staff/Senior Staff SRE to define the future of our cloud platform and...SeniorFull timeContract workWork at office- We are seeking a Sr. Site Reliability Engineer to join our team and run critical infrastructure for our blockchain and web applications. You’ll learn... ...tools to streamline development processes. DevOps Engineer/SRE Transitioning to Blockchain An experienced DevOps Engineer...SeniorRemote job
- ...deeply human. Heidi is building an AI Care Partner that works... ...possible. We’re a team of doctors, engineers, designers, researchers, and... ...-to-end. Improve operational reliability: Identify recurring issues... ...re looking for 3-6+ years in SRE, DevOps, Platform, or operations...SeniorWork at officeWorldwide
- ...acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from companies like... ...go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of...Senior
$181k - $263k
...privacy requirements. The Global SRE team is responsible for owning and supporting... ...support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction... ...organization Familiarity with LLMs and AI-assisted development workflows,...SeniorWork from homeFlexible hoursNight shift- ...information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site Reliability Engineerlocations:... ...Function Site Reliability Engineering (SRE) is a discipline that incorporates... ...in Python supported by Gen AI tooling to accelerate development of...SeniorImmediate startRemote workWorldwide
$166.9k - $225.9k
Job Summary Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit... ...6+ years of experience in Site Reliability Engineering, Cloud... ...Experience with AIOps—using AI/ML‑based tooling for anomaly...SeniorFlexible hours- CloudDevs: Senior Web site Reliability Engineer (SRE) CloudDevs works with fast-moving, venture-backed startups throughout the US. We’re constructing a pool of world-class Web site Reliability Engineers for present roles and for upcoming alternatives. You’ll both be positioned...Senior
$163k - $203k
GoTo Meeting is looking for a Senior Site Reliability Engineer in San Francisco. You will be responsible for the reliability, scalability, and security... ...candidate will mentor junior engineers and implement AI-driven operations. Benefits include a hybrid work model, competitive...Senior- An innovative R&D company in San Francisco is seeking a Site Reliability Engineer to join its Platform Engineering team. This position focuses on ensuring the reliability and performance of an AI-powered code review platform. The ideal candidate will have 6-8 years of experience...Senior
- Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by... ...and engineering. The Role This is not a generalist SRE role. You will design, operate, and debug large‑scale GPU...SeniorFull timeRemote work
$232k - $319k
...Secure Every Identity, from AI to Human Identity is the key to... ...service with great people and reliable, cost-effective, and efficient... ...and various initiatives across SRE & Infrastructure organization.... ...velocity of SRE and product engineering by developing robust platforms...SeniorPermanent employmentLocal areaWorldwideFlexible hours$202.8k - $327.63k
...management (CLM). What you’ll do The Senior Director, SRE Platform Engineering is a senior engineering leader... ...IT Service Management (ITSM) and Site Reliability Engineering (SRE) capabilities, applying... ...lead teams that deliver secure, AI‑driven, and intuitive experiences...SeniorPermanent employmentContract workWork at officeLocal areaRemote work2 days per week- Senior Infrastructure Engineer - Bland As a Senior Infrastructure Engineer at Bland,... ...processing with strict latency and reliability requirements; building and... ...industries. Lead - AI/ML Stack Infrastructure Lead... ...global deployments. Work with Site Reliability Engineering to...SeniorTemporary work
- ...We Are Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open... ...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure...Senior
$160k - $250k
...machine learning models, we also need to grow our DevOps and Site Reliability team to maintain the reliability of our enterprise SaaS offering... ...individuals who are passionate about creating a revolutionary AI company. At Hive, you will have a steep learning curve and an...Senior$200k - $250k
...platform that combines modern web tooling with AI-powered workflows. Our stack includes React/... ...based production infrastructure. We’re hiring a senior owner of stability and infrastructure to ensure the platform is reliable, fast, and resilient as we scale. Role Mission...SeniorPermanent employment- ...Udaip Cloud-Based Data And Ai Platform Engineer At U.S. Bank, we're on a journey to do our best. Helping the customers and businesses we serve to make better and smarter financial decisions and enabling the communities we support to grow and succeed. We believe it...SeniorTemporary workWork experience placement
$181.69k - $213.75k
...Senior Site Reliability Engineer San Francisco, California; Santa Clara, California; Seattle, WA The Company You'll Join Carta connects founders... ...of RESTful and/or GraphQL API design principles. AI Fluency: You use AI tools in your own day-to-day work in...SeniorFull timeWork at office- A leading language learning platform is seeking an experienced SRE Engineer to ensure the reliability and resilience of their infrastructure. Responsibilities include leading incident response, improving observability, and collaborating with various teams to enhance platform...Senior
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Site Reliability Engineer (SRE) - AI Inftastructure. Be the first to apply!
- site reliability engineer remote San Francisco, CA
- site reliability engineer San Francisco, CA
- site reliability engineer sre San Francisco, CA
- senior data management analyst San Francisco, CA
- senior app developer San Francisco, CA
- senior game producer San Francisco, CA
- senior retail sales associate San Francisco, CA
- senior manager quality engineering San Francisco, CA
- senior software test automation engineer San Francisco, CA
- senior quantitative risk analyst San Francisco, CA

