Senior Site Reliability Engineer (GPU Clusters) - Hosting
$250kLooking for a role with plenty of growth opportunities?
Join a rapidly scaling AI cloud infrastructure provider building a next-generation GPU platform designed for AI training, experimentation, and inference at scale. The company is developing a fully featured AI cloud platform powered by renewable energy and is already operating with strong momentum across Europe, while now significantly expanding its footprint in the United States.
The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments powering GPU-intensive workloads. The role involves working closely with platform, ML, and infrastructure teams to improve reliability, automation, and observability across distributed compute environments while supporting long-term infrastructure growth and scalability.
Don’t miss out on this exciting opportunity and apply today!
Responsibilities:
- Ensure the reliability, scalability, and performance of HPC and cloud infrastructure environments
- Design, build, and maintain automation, observability, and monitoring frameworks for GPU compute clusters
- Collaborate with ML, data, and platform engineering teams to deliver highly available infrastructure systems
- Improve CI/CD pipelines, deployment workflows, and operational tooling
- Contribute to infrastructure architecture discussions and long-term platform strategy
- Diagnose performance bottlenecks across distributed systems and HPC workloads
- Support and optimize Slurm-based GPU cluster environments
- Participate in an on-call rotation supporting mission-critical infrastructure operations
Skills/Must Have:
- Deep experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or related fields
- Strong experience supporting HPC or large-scale distributed compute environments
- Deep Linux expertise (Ubuntu/Debian preferred)
- Strong scripting and automation skills using Python, Go, or Bash
- Hands-on experience with public cloud platforms or modern GPU cloud providers
- Strong understanding of networking fundamentals (DNS, TCP/IP, routing, performance optimization)
- Experience with Infrastructure-as-Code tooling such as Terraform and Ansible
- Proven experience operating Slurm-based GPU/HPC clusters
- Ability to troubleshoot distributed systems and optimize workload scheduling/performance
Benefits:
- Stock options
- Bonus
- Remote working option and allowance
Salary:
- Circa $250,000 base salary
- A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong...Senior
- Electric Capital is seeking an experienced engineer to manage GPU clusters in San Francisco. You’ll be responsible for deploying and maintaining clusters, contributing to a small and ambitious team. Ideal candidates will have over 5 years of relevant experience and be comfortable...Senior
- Baseten is hiring a Network Engineer (Data Centers) in San Francisco to design and own the high-performance network infrastructure for their GPU clusters. This senior role collaborates closely with hardware and platform teams, directly impacting model performance and inference...SeniorFlexible hours
- ...management. We have become a multibillion‑dollar asset manager, and we have ambitious goals for the future. As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering...SeniorLocal area
- Cortes 23 in San Francisco is seeking a Senior Site Reliability Engineer to design and operate large-scale GPU infrastructure. This high-impact role requires deep expertise in distributed systems and a proactive approach to incident management. The successful candidate...SeniorRemote job
- ...The TeamPlatform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational... ...fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper)....SeniorWork at officeLocal areaRemote workWorldwideFlexible hours
- ...The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic... ...), including multi-tenancy, quotas, and GPU/host fleet management. Lead cluster operations... ...services that ensure workload SLOs and reliable runtime execution. Define and implement...
- ...computing resources across the globe, we offer an innovative GPU marketplace and AI inference service that promise... ...poised to redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure...Senior
- ...that possible. We’re a team of doctors, engineers, designers, researchers, and creatives... ...end-to-end. Improve operational reliability: Identify recurring issues and reliability... ...environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core...SeniorWork at officeWorldwide
$300k
...training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring... ...operational backbone of one of the largest GPU clusters in private deployment. If you want...SeniorPermanent employment$165k - $225k
...growing and changing Stellar ecosystem. SDF is looking for a Senior Site Reliability Engineer to help build and operate the foundation that powers our... .... Build, maintain, monitor and improve our Kubernetes clusters. Work with development teams on migrating applications to...SeniorTemporary workWork at officeLocal areaWorldwideFlexible hours$325k
Engineering at Ivo Engineers At Ivo Are Inventors. Ivo Was First-to-market With... ...without sacrificing accuracy [2024] Clustering legal documents descended from the... ...our SLAs. We’re looking for an Senior or Staff Site level Reliability Engineer as part of the Infrastructure...SeniorContract work$320k - $405k
Menlo Ventures is hiring a Cluster Deployment Engineer in San Francisco. In this senior role, you will be responsible for defining the deployment strategy for AI compute clusters, ensuring successful integration across hardware and facilities. Candidates should have at...SeniorVisa sponsorshipWork visa$15 per hour
Summary The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to support and develop the platform serving the world’s favorite... ...everyone should be able to access that knowledge freely. We host Wikipedia and the Wikimedia projects, build software...SeniorPermanent employmentFor contractorsRemote work$181k - $263k
...providing first line operational support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction for reliability... ..., and friendly people who love what they do.Fun: We host in-person and virtual events such as game nights, happy...SeniorWork from homeFlexible hoursNight shift$232k - $319k
...millions of users a day. The service is hosted on Amazon Web Services (AWS) across multiple... ...scale the service with great people and reliable, cost-effective, and efficient... ...partnership with architects and product engineering Build a world-class observability platform...SeniorPermanent employmentLocal areaWorldwideFlexible hours- Electric Capital is seeking a Systems Software Engineer to enhance UEFI bare-metal and VM platform for their San Francisco operations... ...experience with fault-tolerant distributed systems, Kubernetes, and GPU clusters. This position comes with a competitive salary, equity grants,...Visa sponsorship
- Aspinity is looking for a Senior Infrastructure Engineer in San Francisco, CA to manage cloud infrastructure and drive self-hosted deployments. The ideal candidate will have 7+ years in infrastructure roles with experience in cloud services and IaC tools. Responsibilities...Senior
- A tech startup in AI is seeking a Senior Infrastructure Engineer in San Francisco, CA. This role involves building and scaling a GPU Cloud Marketplace, transforming raw GPUs into a programmable pool for AI developers. Successful candidates will have deep knowledge in infrastructure...Senior
- A pioneering AI company in San Francisco is seeking a Senior Infrastructure Engineer to build and scale their GPU Cloud Marketplace. In this foundational role, you will transform raw GPU resources into a programmable pool that serves thousands of AI developers. The position...SeniorWorldwide
- A technology infrastructure company in San Francisco is seeking an experienced engineer to manage and operate GPU clusters. The role requires over 5 years of hands-on experience, a deep understanding of hardware systems, and a passion for automating fleet operations. You...Senior
- MakerMaker, based in San Francisco, is seeking a highly skilled kernel engineer to write and optimize GPU kernels that enhance performance for training and inference. This role involves deep, low-level work to close the significant performance gap that exists in modern...Senior
$250k
Hamilton Barnes Associates Limited in San Francisco is seeking an experienced engineer to design and maintain large-scale GPU clusters for training and inference. The candidate should have over 7 years in SRE or DevOps, with strong skills in Kubernetes and Linux systems...Senior- A blockchain analytics company in San Francisco is seeking a Senior Software Engineer, ML Infrastructure to design and operate GPU-backed systems for AI. The ideal candidate will have 5+ years of experience in building distributed infrastructure and a bachelor’s degree...Senior
- B Capital is seeking a Systems Engineer to join its Compute Platform team in San Francisco.... ...complex systems challenges, focusing on GPU infrastructures and multi-cloud environments... ...candidate has extensive experience in cluster management, strong coding skills, and deep...
- ...leading AI technology company in San Francisco is looking for a Senior Software Engineer to build scalable infrastructure for large‑scale training... .... You will design distributed training systems and optimize GPU utilization while collaborating with cross-functional teams...Senior
- A leading AI research company in San Francisco is seeking a software engineer for its Fleet High Performance Computing team. In this role, you'll ensure the reliability and uptime of the compute fleet, working with automation systems and monitoring tools. Ideal candidates...Senior
$140k - $205k
...Senior Technology Site Reliability Engineer Cooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operationsteam. Position summary: The Senior Technology Site Reliability Engineer("SRE") is responsible for ensuring the reliability...SeniorFull timeTemporary workWork at officeFlexible hoursWeekend work- Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups... ...to ML/AI infrastructure or GPU-based systems (CUDA, Slurm,...Full timeRemote work
$230k
...these hyperscale supercomputers reliable and efficient during the training... ...Role We are looking for engineers to operate the next generation of compute clusters that power OpenAI's frontier research... ...bare-metal Linux environments, GPU hardware, and large-scale...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Site Reliability Engineer (GPU Clusters) - Hosting. Be the first to apply!
- site reliability engineer San Francisco, CA
- site reliability engineer remote San Francisco, CA
- site reliability engineer sre San Francisco, CA
- senior development executive San Francisco, CA
- senior technical manager San Francisco, CA
- senior manager data science San Francisco, CA
- senior platform engineer San Francisco, CA
- senior procurement San Francisco, CA
- senior director product management San Francisco, CA
- senior cost manager San Francisco, CA

