Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Site Reliability Engineer (GPU Clusters) - Hosting

$250k
Full-time

Looking for a role with plenty of growth opportunities?

Join a rapidly scaling AI cloud infrastructure provider building a next-generation GPU platform designed for AI training, experimentation, and inference at scale. The company is developing a fully featured AI cloud platform powered by renewable energy and is already operating with strong momentum across Europe, while now significantly expanding its footprint in the United States.

The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments powering GPU-intensive workloads. The role involves working closely with platform, ML, and infrastructure teams to improve reliability, automation, and observability across distributed compute environments while supporting long-term infrastructure growth and scalability.

Don’t miss out on this exciting opportunity and apply today!

Responsibilities:

  • Ensure the reliability, scalability, and performance of HPC and cloud infrastructure environments
  • Design, build, and maintain automation, observability, and monitoring frameworks for GPU compute clusters
  • Collaborate with ML, data, and platform engineering teams to deliver highly available infrastructure systems
  • Improve CI/CD pipelines, deployment workflows, and operational tooling
  • Contribute to infrastructure architecture discussions and long-term platform strategy
  • Diagnose performance bottlenecks across distributed systems and HPC workloads
  • Support and optimize Slurm-based GPU cluster environments
  • Participate in an on-call rotation supporting mission-critical infrastructure operations

Skills/Must Have:

  • Deep experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or related fields
  • Strong experience supporting HPC or large-scale distributed compute environments
  • Deep Linux expertise (Ubuntu/Debian preferred)
  • Strong scripting and automation skills using Python, Go, or Bash
  • Hands-on experience with public cloud platforms or modern GPU cloud providers
  • Strong understanding of networking fundamentals (DNS, TCP/IP, routing, performance optimization)
  • Experience with Infrastructure-as-Code tooling such as Terraform and Ansible
  • Proven experience operating Slurm-based GPU/HPC clusters
  • Ability to troubleshoot distributed systems and optimize workload scheduling/performance

Benefits:

  • Stock options
  • Bonus 
  • Remote working option and allowance 

Salary:

  • Circa $250,000 base salary 
Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer (GPU Clusters) - Hosting in San Francisco, CA vacancy
  • A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong... 
    Senior

    Hyperbolic Labs

    San Francisco, CA
    4 days ago
  • Electric Capital is seeking an experienced engineer to manage GPU clusters in San Francisco. You’ll be responsible for deploying and maintaining clusters, contributing to a small and ambitious team. Ideal candidates will have over 5 years of relevant experience and be comfortable... 
    Senior

    Electric Capital

    San Francisco, CA
    4 days ago
  • Baseten is hiring a Network Engineer (Data Centers) in San Francisco to design and own the high-performance network infrastructure for their GPU clusters. This senior role collaborates closely with hardware and platform teams, directly impacting model performance and inference... 
    Senior
    Flexible hours

    Baseten

    San Francisco, CA
    4 days ago
  •  ...management. We have become a multibillion‑dollar asset manager, and we have ambitious goals for the future. As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering... 
    Senior
    Local area

    The Voleon Group

    Berkeley, CA
    5 days ago
  • Cortes 23 in San Francisco is seeking a Senior Site Reliability Engineer to design and operate large-scale GPU infrastructure. This high-impact role requires deep expertise in distributed systems and a proactive approach to incident management. The successful candidate... 
    Senior
    Remote job

    Cortes 23

    San Francisco, CA
    1 day ago
  •  ...The TeamPlatform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational...  ...fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper).... 
    Senior
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    San Francisco, CA
    18 hours ago
  •  ...The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic...  ...), including multi-tenancy, quotas, and GPU/host fleet management. Lead cluster operations...  ...services that ensure workload SLOs and reliable runtime execution. Define and implement... 

    Hamilton Barnes Associates Limited

    San Francisco, CA
    2 days ago
  •  ...computing resources across the globe, we offer an innovative GPU marketplace and AI inference service that promise...  ...poised to redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure... 
    Senior

    deCircle

    San Francisco, CA
    1 day ago
  •  ...that possible. We’re a team of doctors, engineers, designers, researchers, and creatives...  ...end-to-end. Improve operational reliability: Identify recurring issues and reliability...  ...environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core... 
    Senior
    Work at office
    Worldwide

    Heidi Health Ltd

    San Francisco, CA
    2 days ago
  • $300k

     ...training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring...  ...operational backbone of one of the largest GPU clusters in private deployment. If you want... 
    Senior
    Permanent employment
    San Francisco, CA
    more than 2 months ago
  • $165k - $225k

     ...growing and changing Stellar ecosystem. SDF is looking for a Senior Site Reliability Engineer to help build and operate the foundation that powers our...  .... Build, maintain, monitor and improve our Kubernetes clusters. Work with development teams on migrating applications to... 
    Senior
    Temporary work
    Work at office
    Local area
    Worldwide
    Flexible hours

    Stellar

    San Francisco, CA
    5 days ago
  • $325k

    Engineering at Ivo Engineers At Ivo Are Inventors. Ivo Was First-to-market With...  ...without sacrificing accuracy [2024] Clustering legal documents descended from the...  ...our SLAs. We’re looking for an Senior or Staff Site level Reliability Engineer as part of the Infrastructure... 
    Senior
    Contract work

    Icehouseventures

    San Francisco, CA
    1 day ago
  • $320k - $405k

    Menlo Ventures is hiring a Cluster Deployment Engineer in San Francisco. In this senior role, you will be responsible for defining the deployment strategy for AI compute clusters, ensuring successful integration across hardware and facilities. Candidates should have at... 
    Senior
    Visa sponsorship
    Work visa

    Menlo Ventures

    San Francisco, CA
    4 days ago
  • $15 per hour

    Summary The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to support and develop the platform serving the world’s favorite...  ...everyone should be able to access that knowledge freely. We host Wikipedia and the Wikimedia projects, build software... 
    Senior
    Permanent employment
    For contractors
    Remote work

    Nerdleveltech

    San Francisco, CA
    1 day ago
  • $181k - $263k

     ...providing first line operational support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction for reliability...  ..., and friendly people who love what they do.Fun: We host in-person and virtual events such as game nights, happy... 
    Senior
    Work from home
    Flexible hours
    Night shift

    Liveramp

    San Francisco, CA
    5 days ago
  • $232k - $319k

     ...millions of users a day. The service is hosted on Amazon Web Services (AWS) across multiple...  ...scale the service with great people and reliable, cost-effective, and efficient...  ...partnership with architects and product engineering Build a world-class observability platform... 
    Senior
    Permanent employment
    Local area
    Worldwide
    Flexible hours

    Okta, Inc.

    San Francisco, CA
    2 days ago
  • Electric Capital is seeking a Systems Software Engineer to enhance UEFI bare-metal and VM platform for their San Francisco operations...  ...experience with fault-tolerant distributed systems, Kubernetes, and GPU clusters. This position comes with a competitive salary, equity grants,... 
    Visa sponsorship

    Electric Capital

    San Francisco, CA
    2 days ago
  • Aspinity is looking for a Senior Infrastructure Engineer in San Francisco, CA to manage cloud infrastructure and drive self-hosted deployments. The ideal candidate will have 7+ years in infrastructure roles with experience in cloud services and IaC tools. Responsibilities... 
    Senior

    Aspinity

    San Francisco, CA
    1 day ago
  • A tech startup in AI is seeking a Senior Infrastructure Engineer in San Francisco, CA. This role involves building and scaling a GPU Cloud Marketplace, transforming raw GPUs into a programmable pool for AI developers. Successful candidates will have deep knowledge in infrastructure... 
    Senior

    Hyperbolic Labs

    San Francisco, CA
    4 days ago
  • A pioneering AI company in San Francisco is seeking a Senior Infrastructure Engineer to build and scale their GPU Cloud Marketplace. In this foundational role, you will transform raw GPU resources into a programmable pool that serves thousands of AI developers. The position... 
    Senior
    Worldwide

    deCircle

    San Francisco, CA
    1 day ago
  • A technology infrastructure company in San Francisco is seeking an experienced engineer to manage and operate GPU clusters. The role requires over 5 years of hands-on experience, a deep understanding of hardware systems, and a passion for automating fleet operations. You... 
    Senior

    The San Francisco Compute Company

    San Francisco, CA
    3 days ago
  • MakerMaker, based in San Francisco, is seeking a highly skilled kernel engineer to write and optimize GPU kernels that enhance performance for training and inference. This role involves deep, low-level work to close the significant performance gap that exists in modern... 
    Senior

    MakerMaker

    San Francisco, CA
    2 days ago
  • $250k

    Hamilton Barnes Associates Limited in San Francisco is seeking an experienced engineer to design and maintain large-scale GPU clusters for training and inference. The candidate should have over 7 years in SRE or DevOps, with strong skills in Kubernetes and Linux systems... 
    Senior

    Hamilton Barnes Associates Limited

    San Francisco, CA
    5 days ago
  • A blockchain analytics company in San Francisco is seeking a Senior Software Engineer, ML Infrastructure to design and operate GPU-backed systems for AI. The ideal candidate will have 5+ years of experience in building distributed infrastructure and a bachelor’s degree... 
    Senior

    TRM Labs

    San Francisco, CA
    3 days ago
  • B Capital is seeking a Systems Engineer to join its Compute Platform team in San Francisco....  ...complex systems challenges, focusing on GPU infrastructures and multi-cloud environments...  ...candidate has extensive experience in cluster management, strong coding skills, and deep... 

    B Capital

    San Francisco, CA
    5 days ago
  •  ...leading AI technology company in San Francisco is looking for a Senior Software Engineer to build scalable infrastructure for large‑scale training...  .... You will design distributed training systems and optimize GPU utilization while collaborating with cross-functional teams... 
    Senior

    Baseten

    San Francisco, CA
    4 days ago
  • A leading AI research company in San Francisco is seeking a software engineer for its Fleet High Performance Computing team. In this role, you'll ensure the reliability and uptime of the compute fleet, working with automation systems and monitoring tools. Ideal candidates... 
    Senior

    OpenAI

    San Francisco, CA
    3 days ago
  • $140k - $205k

     ...Senior Technology Site Reliability Engineer Cooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operationsteam. Position summary: The Senior Technology Site Reliability Engineer("SRE") is responsible for ensuring the reliability... 
    Senior
    Full time
    Temporary work
    Work at office
    Flexible hours
    Weekend work

    Cooley

    San Francisco, CA
    17 hours ago
  • Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups...  ...to ML/AI infrastructure or GPU-based systems (CUDA, Slurm,... 
    Full time
    Remote work

    Andromeda Cluster

    San Francisco, CA
    4 days ago
  • $230k

     ...these hyperscale supercomputers reliable and efficient during the training...  ...Role We are looking for engineers to operate the next generation of compute clusters that power OpenAI's frontier research...  ...bare-metal Linux environments, GPU hardware, and large-scale... 

    OpenAI

    San Francisco, CA
    10 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer (GPU Clusters) - Hosting. Be the first to apply!