Senior Site Reliability Engineer (GPU Clusters) - Hosting

$250k

Full-time

Looking for a role with plenty of growth opportunities?

Join a rapidly scaling AI cloud infrastructure provider building a next-generation GPU platform designed for AI training, experimentation, and inference at scale. The company is developing a fully featured AI cloud platform powered by renewable energy and is already operating with strong momentum across Europe, while now significantly expanding its footprint in the United States.

The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments powering GPU-intensive workloads. The role involves working closely with platform, ML, and infrastructure teams to improve reliability, automation, and observability across distributed compute environments while supporting long-term infrastructure growth and scalability.

Don’t miss out on this exciting opportunity and apply today!

Responsibilities:

Ensure the reliability, scalability, and performance of HPC and cloud infrastructure environments
Design, build, and maintain automation, observability, and monitoring frameworks for GPU compute clusters
Collaborate with ML, data, and platform engineering teams to deliver highly available infrastructure systems
Improve CI/CD pipelines, deployment workflows, and operational tooling
Contribute to infrastructure architecture discussions and long-term platform strategy
Diagnose performance bottlenecks across distributed systems and HPC workloads
Support and optimize Slurm-based GPU cluster environments
Participate in an on-call rotation supporting mission-critical infrastructure operations

Skills/Must Have:

Deep experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or related fields
Strong experience supporting HPC or large-scale distributed compute environments
Deep Linux expertise (Ubuntu/Debian preferred)
Strong scripting and automation skills using Python, Go, or Bash
Hands-on experience with public cloud platforms or modern GPU cloud providers
Strong understanding of networking fundamentals (DNS, TCP/IP, routing, performance optimization)
Experience with Infrastructure-as-Code tooling such as Terraform and Ansible
Proven experience operating Slurm-based GPU/HPC clusters
Ability to troubleshoot distributed systems and optimize workload scheduling/performance

Benefits:

Stock options
Bonus
Remote working option and allowance

Salary:

Circa $250,000 base salary

Apply

Vacancy posted a month ago

Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer (GPU Clusters) - Hosting in San Francisco, CA vacancy

Sr. Site Reliability Engineer
...Sr. Site Reliability Engineer Job type: Full Time · Department: Platform... ...both SaaS and self-hosted options. Our mission:... ...We’re looking for a Senior Platform Engineer to... ...tooling (Replicated, Cluster API, Talos, Rancher,... ...infrastructure exposure: GPU scheduling, model‑...
Senior
Full time
Remote work
Neara
San Francisco, CA
5 days ago
Staff Engineer, Hosted AI Training Platform
...superintelligence infrastructure in San Francisco. You will lead efforts in developing a hosted training platform that enables users to launch LoRA and fine-tuning runs on managed GPU clusters. Ideal candidates will have strong Kubernetes operations, backend development in...
Suggested
Flexible hours
Prime-Intellect
San Francisco, CA
5 days ago
Senior Site Reliability Engineer (SRE) - AI Inftastructure
$300k
...training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring... ...operational backbone of one of the largest GPU clusters in private deployment. If you want...
Senior
Permanent employment
San Francisco, CA
more than 2 months ago
Senior AI Inference Engineer - GPU, Rust & CUDA
$220k
Perplexity is looking for an engineer to join their team in San Francisco. You will work on building and operating the inference engine, supporting new models, migrating GPU kernels, and developing a Rust-based serving runtime. The ideal candidate has 3+ years of experience...
Senior
Perplexity
San Francisco, CA
1 day ago
Senior Cluster Site Reliability Engineer
$205k - $235k
...management. We have become a multibillion-dollar asset manager, and we have ambitious goals for the future. As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage...
Senior
Remote job
Local area
The Voleon Group
Berkeley, CA
more than 2 months ago
Senior AI Runtime Engineer: Scale GPU Training
$160k - $225k
Cacheflow is seeking a Senior Software Engineer for AI Runtime at Databricks, located in San Francisco. You will be instrumental in building and scaling systems for large-scale GPU training, ensuring high throughput and resilience in training across expansive fleets of...
Senior
Cacheflow
San Francisco, CA
4 days ago
Senior Software Engineer - Infrastructure Storage
.... One person, one GPU. If you'd like to... ...Lambda Infrastructure Engineering organization... ...high-performance AI clusters by welding together... ...petabytes of data hosted on large, high-... ...for an experienced Senior Software Engineer... ...scalability, and reliability. Implement and optimize...
Senior
Work at office
Local area
Work from home
Lambda
San Francisco, CA
5 days ago
Senior Site Reliability Engineer
...US Corp. is seeking a Lead Site Reliability Engineer to spearhead our mission of delivering highly available and performant systems. With an average of over 12 years of industry experience, the successful candidate will bridge the gap between software development and systems...
Senior
Axiom Pursuits
San Francisco, CA
2 days ago
Compute Platform Engineer - GPU & Multi-Cloud Infra
B Capital is seeking a Systems Engineer to join its Compute Platform team in San Francisco.... ...complex systems challenges, focusing on GPU infrastructures and multi-cloud environments... ...candidate has extensive experience in cluster management, strong coding skills, and deep...
B Capital
San Francisco, CA
3 days ago
Senior GPU HPC Platform Reliability Engineer
A leading AI research company in San Francisco is seeking a software engineer for its Fleet High Performance Computing team. In this role, you'll ensure the reliability and uptime of the compute fleet, working with automation systems and monitoring tools. Ideal candidates...
Senior
Jobleads-US
San Francisco, CA
1 day ago
Senior ML Training Systems Engineer - Distributed GPU Infra
...leading AI technology company in San Francisco is looking for a Senior Software Engineer to build scalable infrastructure for large‑scale training... .... You will design distributed training systems and optimize GPU utilization while collaborating with cross-functional teams...
Senior
Baseten
San Francisco, CA
2 days ago
Senior/Staff Site Reliability Engineer
$50 per hour
...years of professional SRE experience 5+ years of experience contributing to architecture and design (architecture, design patterns, reliability and scaling) of new and current systems Bachelor’s Degree in Computer Science or related field, or 8+ years relevant work...
Senior
Temporary work
Work experience placement
Dormont Manufacturing Company
San Francisco, CA
5 days ago
Senior Backend Engineer - GPU Inference & Real-time Systems
...leading design technology company in San Francisco is seeking a Senior Software Engineer for Backend (Systems / Infrastructure). You will architect... ...demand grows. This role involves optimizing APIs, managing GPU workloads, and collaborating with cross-functional teams....
Senior
Vizcom
San Francisco, CA
5 days ago
Remote Senior Site Reliability Engineer (SRE) - Zetachain
We are seeking a Sr. Site Reliability Engineer to join our team and run critical infrastructure for our blockchain and web applications. You’ll learn to deploy and maintain a fleet of RPC and validator nodes for multiple blockchain networks. You’ll also provide guidance...
Senior
Remote job
Blockchain Works
San Francisco, CA
a month ago
Senior Site Reliability Engineer
...acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from... ...redefining go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data...
Senior
Unify
San Francisco, CA
5 days ago
Head of Platform/AI Cluster Management - System Integrator
...The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic... ...), including multi-tenancy, quotas, and GPU/host fleet management. Lead cluster operations... ...services that ensure workload SLOs and reliable runtime execution. Define and implement...
Permanent employment
San Francisco, CA
more than 2 months ago
Senior Enterprise Platform Engineer — Self-Hosted Cloud Infra
Baseten in San Francisco is seeking a Senior Enterprise Platform Engineer to architect and develop infrastructure features for enterprise customers. You will lead technical initiatives and design solutions addressing complex regulatory requirements. The ideal candidate...
Senior
Dormont Manufacturing Co
San Francisco, CA
1 day ago
CloudDevs: Senior Web site Reliability Engineer (SRE)
CloudDevs: Senior Web site Reliability Engineer (SRE) CloudDevs works with fast-moving, venture-backed startups throughout the US. We’re constructing a pool of world-class Web site Reliability Engineers for present roles and for upcoming alternatives. You’ll both be positioned...
Senior
The10minutecareersolution
San Francisco, CA
2 days ago
Principal Site Reliability Engineer
$300 per month
...About This Role As a Principal Site Reliability Engineer, you will play a critical role in designing... ...next-generation NeoCloud built for AI, GPU, and high-performance workloads. This... ...a technical authority and mentor for senior and staff-level engineers across the SRE...
Temporary work
Dormont Manufacturing Company
San Francisco, CA
3 days ago
Senior Software Engineer - Enterprise Platform
$200k - $270k
...leading performance, security, and reliability for their mission-critical... ...all products. THE ROLE As a Senior Enterprise Platform Engineer at Baseten, you’ll architect and... ...partners. This includes enabling self-hosted and single tenant clusters, region-aware request routing,...
Senior
Dormont Manufacturing Co
San Francisco, CA
1 day ago
Site Reliability Engineer (Senior or Staff), Infrastructure Security
$127k - $249k
Senior / Staff Engineer - SRE, InfraSec We are looking for an experienced Senior or Staff Engineer for our SRE, InfraSec team to guide the security of our cloud‑based infrastructure. You will be highly hands‑on technically while also mentoring a small team of SREs. The...
Senior
Local area
Remote work
The Consulting Solutions
San Francisco, CA
2 days ago
Site Reliability Engineer III
$151.5k - $252.5k
...brands. About The Role We are looking for an experienced Senior Site Reliability Engineer to join the Veeam Data Cloud (VDC) engineering team. You... ...Cosmos DB, Storage services, Azure Functions, static website hosting, Azure security, etc.) IaC tools (Azure ARM templates,...
Base plus commission
Local area
Worldwide
Veeam
San Francisco, CA
22 hours ago
Senior Software Engineer - Model Performance
$220k - $320k
...About Inference.net Inference.net trains and hosts specialized language models for companies... ...We are a well‑funded ten‑person team of engineers who work in‑person in downtown San... ...performance Profile and optimize CUDA kernels and GPU utilization across our serving...
Senior
Work at office
inference.net
San Francisco, CA
3 days ago
Senior Software Engineer - ML/CV Infrastructure
Overview We\'re looking for a Staff Software Engineer - Computer Vision Deployment to build... ...vision and multi-modal AI systems run reliably and efficiently in production. Your... .... You\'ll design scalable GPU compute clusters, build robust orchestration pipelines,...
Senior
Work at office
3 days per week
Claryo
San Francisco, CA
3 days ago
Senior GPU ML Infra Engineer — Mid-Training & Inference
A cutting-edge AI technology company based in San Francisco is seeking a specialist to design and operate large-scale GPU infrastructure. This role requires expertise in deploying GPU systems for high-throughput inference and model performance optimization. The ideal candidate...
Senior
Reflection AI
San Francisco, CA
2 days ago
Senior Inference Performance Engineer - GPU & CUDA
$220k - $320k
inference.net, a growing company in San Francisco, seeks an experienced engineer to optimize AI inference performance. The ideal candidate will have over 2 years of experience in ML systems and GPU programming. Key responsibilities include implementing optimization techniques...
Senior
inference.net
San Francisco, CA
2 days ago
Senior Compiler Engineer GPU JIT/LLVM for Robotics
...training. The ideal candidate has a strong background in compiler construction, specifically JIT compilation and LLVM-based code generation. Knowledge of GPU programming models and collaboration with engineers for compiler improvements are crucial. #J-18808-Ljbffr GenesisAI
Senior
GenesisAI
San Francisco, CA
2 days ago
Senior Inference Platform Engineer - Data Center
$300k
...Our client operates high-performance GPU clusters powering some of the most advanced AI workloads... ..., real-time inference and custom model hosting. This is a unique chance to join at an... ...Integrate, tune, and operate inference engines such as vLLM, SGLang, and TensorRT-LLM across...
Senior
Permanent employment
Worldwide
San Francisco, CA
more than 2 months ago
Software Engineer III/Senior, Data Platform
$180k - $225k
...device fleets, and site-to-site... ...data as a product: reliable, observable, well‑... ...team is part of the Engineering organization and doesn... ...a full Kubernetes cluster of the ngrok stack... ...production. We self host a large part of our... .... Compensation Senior Software Engineer...
Senior
Permanent employment
Full time
Live in
Work at office
Local area
Remote work
Home office
Flexible hours
Dormont Manufacturing Co
San Francisco, CA
1 day ago
Senior HPC Network Architect for GPU Compute
$224k - $284k
Cssmerge is looking for an HPC Network Engineer to join our founding team in San Francisco, responsible for designing and managing high-performance networking that connects our GPU compute. The ideal candidate has experience with network deployment and scaling in HPC or...
Senior
Cssmerge
San Francisco, CA
3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer (GPU Clusters) - Hosting. Be the first to apply!