Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Site Reliability Engineer AI Infrastructure

Cortes 23

Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco • Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early‑stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster – but it filled almost instantly. Since then, we have been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest‑growing markets on earth. Our long‑term vision is to build the liquidity layer for global AI compute – a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world’s financial markets. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. The Role This is not a generalist SRE role. You will design, operate, and debug large‑scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems. We’re looking for engineers who have personally run GPU clusters in production, understand the failure modes of distributed training, and can reason about performance from network fabric – kernel – framework. What You’ll Own GPU Cluster Architecture: Design and evolve multi‑provider, multi‑region GPU compute clusters optimized for large‑scale training. Make topology‑aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency. Customer Technical Partnership: Serve as the primary technical point of contact for customers running large‑scale training workloads. Onboard, troubleshoot, and optimize, often in real time. Reliability & Performance Engineering: Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure (ECC errors, NVLink degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training throughput. Networking & Fabric Health: Ensure the health and performance of high‑speed interconnects (InfiniBand, RoCE, NVLink) that underpin distributed training. Diagnose and resolve fabric‑level issues that degrade collective operations. Observability: Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health. Go well beyond standard infrastructure metrics. Automation & Tooling: Build production‑grade automation for cluster provisioning, GPU health checks, job scheduling, self‑healing, and firmware/driver lifecycle management. Incident Leadership: Lead incident response for complex, multi‑layer failures spanning hardware, networking, orchestration, and ML frameworks. Drive blameless post‑mortems and systemic fixes. What We’re Looking For GPU Systems Expertise: Deep, hands‑on experience operating large‑scale GPU clusters (NVIDIA A100/H100/B200 or equivalent). You understand GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes from direct experience, not documentation. High‑Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training. You can diagnose why an all‑reduce is slow, identify a degraded link in a fat‑tree topology, and reason about congestion control at scale. Distributed Training & ML Frameworks: Working knowledge of how large training jobs actually run – NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar. You don't need to write the models, but you need to understand what’s happening at the systems level when a 1,000‑GPU training run stalls. Linux & Systems Internals: Expert‑level Linux knowledge: kernel tuning, driver management (NVIDIA drivers, CUDA toolkit), cgroup/namespace internals, performance profiling at the syscall and hardware level. Kubernetes & Orchestration: Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology‑aware scheduling, multi‑cluster federation, and custom operators. Experience with Slurm or other HPC schedulers is equally valued. Automation & Software Engineering: Strong engineering skills in Python, Go, or Bash. You build production‑grade tools and services, not just scripts. Infrastructure‑as‑Code proficiency (Terraform, Helm, Ansible, or equivalent). Observability & Monitoring: Hands‑on experience building monitoring and alerting for GPU infrastructure, not just Prometheus/Grafana basics, but GPU‑specific telemetry (DCGM, nvidia‑smi, fabric manager metrics) integrated into actionable dashboards. Incident Management: Proven track record leading incident response for complex distributed systems where the failure could be in hardware, firmware, networking, drivers, orchestration, or application code, and you need to narrow it down fast. Strong Candidates May Have Distributed Storage: Experience with high‑performance parallel file systems (VAST, Weka, Lustre, GPFS) and the checkpoint I/O and data‑loading bottlenecks that come with large training runs. Training Optimization: Experience profiling and optimizing distributed training performance: identifying stragglers, tuning collective communication strategies, improving MFU (Model FLOPs Utilization), and reducing idle GPU time across large runs. Cluster Buildout & Hardware: Experience involved in physical cluster design – rack layout, power/cooling constraints, network topology design, and hardware validation/burn‑in at scale. Team Leadership: Experience leading or mentoring a team of infrastructure engineers. We're growing and need people who raise the bar for everyone around them. Why You’ll Love It Here This is a high‑impact, senior builder’s role. You’ll have significant ownership and autonomy to shape how our systems run at a foundational level, working directly with customers and providers while architecting the infrastructure backbone for reliable, scalable AI compute. You’ll influence technical direction and help define what world‑class AI infrastructure operations look like. Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. #J-18808-Ljbffr Cortes 23

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer AI Infrastructure in San Francisco, CA vacancy
  • A tech company focused on AI is seeking a Site Reliability Engineer to ensure the reliability and performance of its GPU marketplace. This role involves maintaining service level objectives, managing capacity, and implementing secure systems. The ideal candidate has strong... 
    Senior

    Hyperbolic Labs

    San Francisco, CA
    1 day ago
  •  ...security, delivering an AI-powered platform that...  .... As a Staff Platform Engineer, you will play a...  ...leadership role. You will own reliability for major platform...  ...maintaining the shared infrastructure services and platforms...  ...Platform Engineering, or Site Reliability... 
    Senior

    Saviynt

    San Francisco, CA
    4 days ago
  • Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing...  ...redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability,... 
    Senior

    deCircle

    San Francisco, CA
    3 days ago
  •  ...human. Heidi is building an AI Care Partner that works alongside...  .... We’re a team of doctors, engineers, designers, researchers, and...  ...-end. Improve operational reliability: Identify recurring issues...  ...Kubernetes clusters, cloud infrastructure, and core platform services,... 
    Senior
    Work at office
    Worldwide

    Heidi Health Ltd

    San Francisco, CA
    4 days ago
  •  ...Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands...  ...day is currently Tuesday. Engineering at Lambda is responsible...  ...and improve product reliability. Lead members of other engineering...  ...5+ years of experience in Site Reliability Engineering... 
    Senior
    Work at office
    Local area
    Work from home

    Lambda

    San Francisco, CA
    3 days ago
  •  ...was a machine learning research engineer at Scale AI. The rest of our team comes...  ...with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding...  ...You'll Do Scale our data infrastructure: Optimize and extend our ClickHouse... 
    Senior

    Unify

    San Francisco, CA
    4 days ago
  •  ...information, please read ourSenior Site Reliability Engineer page is loaded## Senior Site Reliability...  ...engineering and applies them to infrastructure and operations problems. The main...  ...Programming in Python supported by Gen AI tooling to accelerate development... 
    Senior
    Immediate start
    Remote work
    Worldwide

    OutSystems Inc.

    San Francisco, CA
    4 days ago
  •  ...company in San Francisco is seeking a Site Reliability Engineer to join its Platform Engineering team...  ...reliability and performance of an AI-powered code review platform. The ideal...  ...Engineering, strong knowledge of GCP and infrastructure as code using Terraform. It offers a... 
    Senior

    CodeRabbit

    San Francisco, CA
    2 days ago
  • $60 per hour

    Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law firms, reshaping how data drives impactful...  ...hotfixes — while also automating infrastructure, monitoring systems, and GitHub... 
    Senior
    Full time
    Work at office
    Flexible hours

    Bonfirevc

    San Francisco, CA
    4 days ago
  • $163k - $203k

    GoTo Meeting is looking for a Senior Site Reliability Engineer in San Francisco. You will be responsible for the reliability, scalability, and security...  ...candidate will mentor junior engineers and implement AI-driven operations. Benefits include a hybrid work model, competitive... 
    Senior

    GoTo Meeting

    San Francisco, CA
    4 days ago
  • # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission & Values...  ...operates as both a central engineering function and an embedded reliability...  ...approach reliability.Our infrastructure runs on AWS across multiple...  ...with AIOps - using AI/ML-based tooling for anomaly... 
    Senior
    Work at office
    Immediate start
    Worldwide
    Monday to Friday
    Flexible hours

    Careers at Drata

    San Francisco, CA
    14 hours ago
  • $166.9k - $225.9k

     ...operates as both a central engineering function and an embedded reliability practice. You'll be part...  ...reliability. Our infrastructure runs on AWS across multiple...  ...years of experience in Site Reliability Engineering,...  ...Experience with AIOps—using AI/ML‑based tooling for... 
    Senior
    Flexible hours

    Drata

    San Francisco, CA
    14 hours ago
  • $165k - $225k

     ...Stellar ecosystem. SDF is looking for a Senior Site Reliability Engineer to help build and operate the...  ...our systems, design and improve the infrastructure behind our production environments,...  ...code Experience experimenting with AI-driven approaches to operations Compensation... 
    Senior
    Temporary work
    Work at office
    Local area
    Worldwide
    Flexible hours

    Stellar

    San Francisco, CA
    2 days ago
  • $181k - $263k

     ...operational support. We are looking for a Senior Staff Site Reliability Engineer who will set the technical...  ...engineering across LiveRamp's global infrastructure. This is a senior individual contributor...  ...with LLMs and AI-assisted development workflows, including... 
    Senior
    Work from home
    Flexible hours
    Night shift

    Liveramp

    San Francisco, CA
    2 days ago
  • $151.5k - $252.5k

    A leading technology firm is seeking a Senior Site Reliability Engineer to join their Data Cloud engineering team in San Francisco. The role requires expertise in Azure infrastructure and SaaS applications, focusing on building reliable, scalable systems. The ideal candidate... 
    Senior

    Veeam

    San Francisco, CA
    14 hours ago
  • $250k

     ...exciting new opportunity? Join a seed-stage AI infrastructure company building large-scale training...  ...-as-code, CI/CD pipelines, and reliability standards across thousands of nodes....  ...experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute... 
    Immediate start

    Hamilton Barnes Associates Limited

    San Francisco, CA
    2 days ago
  • Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only... 
    Full time
    Remote work

    Andromeda Cluster

    San Francisco, CA
    1 day ago
  • $261k - $326k

    A technology company specializing in AI infrastructure is seeking a Principal Engineer to enhance reliability and scalability of cloud systems. This role demands over 15 years of experience in production engineering or related fields and involves setting technical directions... 
    Senior

    Crusoe

    San Francisco, CA
    4 days ago
  • $15 per hour

    Summary The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to support and develop the platform serving the world’s favorite...  ...ensuring our global top-10 website and its underlying infrastructure is healthy and developing further in support of... 
    Senior
    Permanent employment
    For contractors
    Remote work

    Nerdleveltech

    San Francisco, CA
    3 days ago
  • $163k - $203k

     ...mission You will be a senior technical contributor...  ..., responsible for the reliability, scalability, and security...  ...as much of a platform engineering role as it is SRE role...  ...building an agentic AI‑first operations model...  ...(managed by the Infrastructure Engineering team) across... 
    Senior
    Work experience placement
    Work at office
    Local area
    Remote work
    Flexible hours
    2 days per week

    Prosper

    San Francisco, CA
    14 hours ago
  • Drata is seeking a Senior Site Reliability Engineer in San Francisco. In this role, you will engage in reliability architecture for product teams, lead production readiness reviews, and build automation around monitoring and alerting. The ideal candidate has at least 6... 
    Senior

    Careers at Drata

    San Francisco, CA
    14 hours ago
  • A dynamic technology firm is seeking a Senior Software Program Engineer to lead software development efforts. You will work collaboratively across teams...  ...years of engineering experience and be eager to leverage AI tools. The role is fully remote, allowing for flexible... 
    Senior
    Remote work
    Flexible hours

    The10minutecareersolution

    San Francisco, CA
    1 day ago
  • $127k - $249k

    We are looking for an experienced Senior or Staff Engineer for our SRE, InfraSec team, to guide the security of our cloud-based infrastructure. As a Staff SRE, you will be very hands‑on technically while also mentoring a small team of SREs. The InfraSec team collaborates... 
    Senior
    Local area
    Remote work
    Flexible hours

    Insider, Inc.

    San Francisco, CA
    4 days ago
  • An innovative AI infrastructure startup is seeking a Sales Engineer to lead technical discovery and drive successful evaluations with clients. The ideal candidate will have significant experience in customer-facing technical roles focused on AI and machine learning infrastructure... 
    Senior
    Remote work

    Andromeda

    San Francisco, CA
    1 day ago
  • $200k - $260k

    A leading AI infrastructure company is seeking a Senior Platform Engineer to manage the API and infrastructure layer for voice workloads. This pivotal role involves building real-time WebSocket and APIs and designing autoscaling for voice applications. Candidates should... 
    Senior

    Together AI

    San Francisco, CA
    6 days ago
  •  ...builds, and operates critical infrastructure that enables research at...  ...workloads, while remaining reliable and easy to use. About the...  ...looking for an experienced Site Reliability Engineer to own production-critical...  ...About OpenAI OpenAI is an AI research and deployment company... 

    OpenAI

    San Francisco, CA
    4 days ago
  • A technology company based in San Francisco is seeking an experienced Platform Engineer to develop user-facing features for their innovative AI Hardware platform. The role requires strong proficiency in TypeScript, Node.js, and React, along with a commitment to collaboration... 
    Senior
    Remote work

    Flux Enterprise

    San Francisco, CA
    1 day ago
  • Airwallex- is seeking a Senior Site Reliability Engineer in San Francisco, California, to work with product teams to build and maintain robust cloud infrastructure. In this role, you will lead critical infrastructure projects, ensuring the reliability and performance of... 
    Senior

    Airwallex-

    San Francisco, CA
    4 days ago
  • $100k - $220k

    I did my part and supported the Regular Toilet is hiring an IT Engineer in San Francisco to enhance and scale IT systems. This full-time role involves collaborating with top talents and contributing to impactful healthcare projects. The ideal candidate is a self-starter... 
    Senior
    Full time

    I did my part and supported the Regular Toilet

    San Francisco, CA
    4 days ago
  • $125k - $165k

    A leading innovator in laboratory software is seeking a Site Reliability Engineer in San Francisco, CA. The role focuses on ensuring reliability and performance of AI systems, managing production infrastructure, and operating resilient systems in cloud environments. The... 

    TELCOR

    San Francisco, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer AI Infrastructure. Be the first to apply!