Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Site Reliability Engineer (GPU Clusters) - Hosting

$250k
Full-time

Looking for a role with plenty of growth opportunities?

Join a rapidly scaling AI cloud infrastructure provider building a next-generation GPU platform designed for AI training, experimentation, and inference at scale. The company is developing a fully featured AI cloud platform powered by renewable energy and is already operating with strong momentum across Europe, while now significantly expanding its footprint in the United States.

The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments powering GPU-intensive workloads. The role involves working closely with platform, ML, and infrastructure teams to improve reliability, automation, and observability across distributed compute environments while supporting long-term infrastructure growth and scalability.

Don’t miss out on this exciting opportunity and apply today!

Responsibilities:

  • Ensure the reliability, scalability, and performance of HPC and cloud infrastructure environments
  • Design, build, and maintain automation, observability, and monitoring frameworks for GPU compute clusters
  • Collaborate with ML, data, and platform engineering teams to deliver highly available infrastructure systems
  • Improve CI/CD pipelines, deployment workflows, and operational tooling
  • Contribute to infrastructure architecture discussions and long-term platform strategy
  • Diagnose performance bottlenecks across distributed systems and HPC workloads
  • Support and optimize Slurm-based GPU cluster environments
  • Participate in an on-call rotation supporting mission-critical infrastructure operations

Skills/Must Have:

  • Deep experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or related fields
  • Strong experience supporting HPC or large-scale distributed compute environments
  • Deep Linux expertise (Ubuntu/Debian preferred)
  • Strong scripting and automation skills using Python, Go, or Bash
  • Hands-on experience with public cloud platforms or modern GPU cloud providers
  • Strong understanding of networking fundamentals (DNS, TCP/IP, routing, performance optimization)
  • Experience with Infrastructure-as-Code tooling such as Terraform and Ansible
  • Proven experience operating Slurm-based GPU/HPC clusters
  • Ability to troubleshoot distributed systems and optimize workload scheduling/performance

Benefits:

  • Stock options
  • Bonus 
  • Remote working option and allowance 

Salary:

  • Circa $250,000 base salary 
Vacancy posted 27 days ago
Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer (GPU Clusters) - Hosting in San Francisco, CA vacancy
  • A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong... 
    Senior

    Hyperbolic Labs

    San Francisco, CA
    1 day ago
  •  ...Technical Staff in AI Supercomputing to design, build, and operate a GPU supercomputing environment. You will enable fast, large-scale...  .... The ideal candidate has a strong background in operating GPU clusters, container orchestration, and deep learning systems. This role... 
    Senior

    Radical Numerics Inc.

    San Francisco, CA
    2 days ago
  • Baseten is hiring a Network Engineer (Data Centers) in San Francisco to design and own the high-performance network infrastructure for their GPU clusters. This senior role collaborates closely with hardware and platform teams, directly impacting model performance and inference... 
    Senior
    Flexible hours

    Baseten

    San Francisco, CA
    6 days ago
  • A leading AI infrastructure company is looking for a Senior Site Reliability Engineer to design and operate large-scale GPU clusters. In this role, you will work closely with clients to troubleshoot and optimize AI infrastructure. The ideal candidate has extensive experience... 
    Senior

    Andromeda

    San Francisco, CA
    4 days ago
  •  ...across the globe, we offer an innovative GPU marketplace and AI inference service...  ...About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace...  ...scale Kubernetes environments, including cluster administration, container... 
    Senior

    Hyperbolic Labs

    San Francisco, CA
    3 days ago
  • $300k

     ...training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring...  ...operational backbone of one of the largest GPU clusters in private deployment. If you want... 
    Senior

    Hamilton Barnes Associates Limited

    San Francisco, CA
    1 day ago
  •  ...superintelligence. One person, one GPU. If you'd like to...  ...is currently Tuesday. Engineering at Lambda is...  ...monitoring for modern AI/HPC cluster infrastructure....  ...adoptable and improve product reliability. Lead members of other...  ...of experience in Site Reliability... 
    Senior
    Work at office
    Local area
    Work from home

    Lambda

    San Francisco, CA
    3 days ago
  •  ...management. We have become a multibillion‑dollar asset manager, and we have ambitious goals for the future. As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering... 
    Senior
    Local area

    The Voleon Group

    Berkeley, CA
    2 days ago
  • Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early‑stage...  ...operate, and debug large‑scale GPU infrastructure used for... 
    Senior
    Full time
    Remote work

    Cortes 23

    San Francisco, CA
    4 days ago
  • Linuxcareers is seeking an Infrastructure/Cluster Engineer to design and operate large-scale clusters that enable AI inference at scale. The...  ...designing observability systems for cluster health. Experience with GPU infrastructure is a plus. #J-18808-Ljbffr Linuxcareers

    Linuxcareers

    San Francisco, CA
    2 days ago
  • $190k - $270k

    AI Chopping Block, Inc. is seeking an AI Infrastructure Engineer in San Francisco. This role requires maintaining user-facing services...  ...systems, specializing in systems while ensuring their reliability and scalability. Candidates should have 5+ years of experience... 
    Senior

    AI Chopping Block, Inc.

    San Francisco, CA
    4 days ago
  • Cortes 23 in San Francisco is seeking a Senior Site Reliability Engineer to design and operate large-scale GPU infrastructure. This high-impact role requires deep expertise in distributed systems and a proactive approach to incident management. The successful candidate... 
    Senior
    Remote job

    Cortes 23

    San Francisco, CA
    3 days ago
  •  ...About the job Senior Site Reliability Engineer About the Company Stellar is a decentralized, public blockchain that gives developers the tools...  .... Build, maintain, monitor and improve our Kubernetes clusters. Work with development teams on migrating applications... 
    Senior

    TechChain Talent

    San Francisco, CA
    4 days ago
  •  ...landscape. The Role You'll be the infrastructure and reliability engineer on the Data Replication team - a full-stack product team running...  ...'re equally comfortable in a Terraform file, a Kubernetes cluster, and a postmortem doc. We expect engineers here to... 
    Senior
    Local area

    Airbyte

    San Francisco, CA
    4 days ago
  • $127k - $249k

     ...The Team Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational...  ...fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper).... 
    Senior
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    San Francisco, CA
    4 days ago
  •  ...The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic...  ...), including multi-tenancy, quotas, and GPU/host fleet management. Lead cluster operations...  ...services that ensure workload SLOs and reliable runtime execution. Define and implement... 

    Hamilton Barnes Associates Limited

    San Francisco, CA
    4 days ago
  • $287k

     ...800% over the last 12 months. Engineering at Ivo Engineers at Ivo are inventors...  ...sacrificing accuracy [2024] Clustering legal documents descended from the...  .... What ? We're looking for an Senior or Staff Site level Reliability Engineer as part of... 
    Senior
    Contract work
    Work at office
    Remote work

    IVO Inc

    San Francisco, CA
    3 days ago
  •  ...that possible. We’re a team of doctors, engineers, designers, researchers, and creatives...  ...end-to-end. Improve operational reliability: Identify recurring issues and reliability...  ...environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core... 
    Senior
    Work at office
    Worldwide

    Heidi Health Ltd

    San Francisco, CA
    4 days ago
  • $181k - $263k

     ...providing first line operational support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical direction for reliability...  ..., and friendly people who love what they do. Fun: We host in-person and virtual events such as game nights, happy... 
    Senior
    Work from home
    Flexible hours
    Night shift

    LiveRamp

    San Francisco, CA
    17 hours ago
  •  ...As a Senior Platform Engineer at vCluster Labs, you aren't just maintaining infrastructure...  ...You will manage Kubernetes clusters, handle patching, manage...  ...Architecture: Architect and host infrastructure for...  ...leading platform for operating GPU infrastructure, enabling AI... 
    Senior
    Remote work
    Flexible hours

    vCluster Labs

    San Francisco, CA
    17 hours ago
  •  ...scientist can scale an ML application from their laptop to the cluster without needing to be a distributed systems expert....  ...raised to date. About the role Anyscale is looking for a Senior Site Reliability Engineer to join the Infrastructure team. Anyscale aims to provide... 
    Senior

    Anyscale

    San Francisco, CA
    3 days ago
  • $220k

    Perplexity is looking for an engineer to join their team in San Francisco. You will work on building and operating the inference engine, supporting new models, migrating GPU kernels, and developing a Rust-based serving runtime. The ideal candidate has 3+ years of experience... 
    Senior

    Perplexity

    San Francisco, CA
    17 hours ago
  • $232k - $319k

     ...millions of users a day. The service is hosted on Amazon Web Services (AWS) across multiple...  ...scale the service with great people and reliable, cost-effective, and efficient...  ...Accelerate the velocity of SRE and product engineering by developing robust platforms, powerful... 
    Senior
    Permanent employment
    Local area
    Worldwide
    Flexible hours

    Okta, Inc.

    San Francisco, CA
    4 days ago
  • Hamilton Barnes Associates Limited is looking for a Senior Storage Engineer to support large-scale AI infrastructure in San Francisco. This role...  ...involves designing scalable storage solutions for high-performance GPU platforms. The ideal candidate has extensive experience in... 
    Senior
    Remote job

    Hamilton Barnes Associates Limited

    San Francisco, CA
    1 day ago
  • Aspinity is looking for a Senior Infrastructure Engineer in San Francisco, CA to manage cloud infrastructure and drive self-hosted deployments. The ideal candidate will have 7+ years in infrastructure roles with experience in cloud services and IaC tools. Responsibilities... 
    Senior

    Aspinity, Inc.

    San Francisco, CA
    3 days ago
  • $160k - $225k

    Cacheflow is seeking a Senior Software Engineer for AI Runtime at Databricks, located in San Francisco. You will be instrumental in building and scaling systems for large-scale GPU training, ensuring high throughput and resilience in training across expansive fleets of... 
    Senior

    Cacheflow

    San Francisco, CA
    3 days ago
  • A tech startup in AI is seeking a Senior Infrastructure Engineer in San Francisco, CA. This role involves building and scaling a GPU Cloud Marketplace, transforming raw GPUs into a programmable pool for AI developers. Successful candidates will have deep knowledge in infrastructure... 
    Senior

    Hyperbolic Labs

    San Francisco, CA
    1 day ago
  • A pioneering AI company in San Francisco is seeking a Senior Infrastructure Engineer to build and scale their GPU Cloud Marketplace. In this foundational role, you will transform raw GPU resources into a programmable pool that serves thousands of AI developers. The position... 
    Senior
    Worldwide

    deCircle

    San Francisco, CA
    3 days ago
  • Senior Infrastructure Engineer - Bland As a Senior Infrastructure Engineer...  ...strict latency and reliability requirements; building...  ...deployments. Work with Site Reliability...  ...production Kubernetes clusters optimized for AI/ML workloads with GPU support, implementing... 
    Senior
    Temporary work

    AI Chopping Block, Inc.

    San Francisco, CA
    1 day ago
  • A technology infrastructure company in San Francisco is seeking an experienced engineer to manage and operate GPU clusters. The role requires over 5 years of hands-on experience, a deep understanding of hardware systems, and a passion for automating fleet operations. You... 
    Senior

    The San Francisco Compute Company

    San Francisco, CA
    17 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer (GPU Clusters) - Hosting. Be the first to apply!