Senior Site Reliability Engineer - AI Infrastructure

Andromeda

Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world’s financial markets. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. The Role This is not a generalist SRE role. You will design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems. We’re looking for engineers who have personally run GPU clusters in production, understand the failure modes of distributed training, and can reason about performance from network fabric → kernel → framework. What You’ll Own GPU Cluster Architecture: Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training. Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency. Customer Technical Partnership: Serve as the primary technical point of contact for customers running large-scale training workloads. Onboard, troubleshoot, and optimize, often in real time. Reliability & Performance Engineering: Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure (ECC errors, NVLink degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training throughput. Networking & Fabric Health: Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink) that underpin distributed training. Diagnose and resolve fabric-level issues that degrade collective operations. Observability: Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health. Go well beyond standard infrastructure metrics. Automation & Tooling: Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management. Incident Leadership: Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks. Drive blameless postmortems and systemic fixes. What We’re Looking For GPU Systems Expertise: Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent). You understand GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes from direct experience not documentation. High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training. You can diagnose why an all-reduce is slow, identify a degraded link in a fat-tree topology, and reason about congestion control at scale. Distributed Training & ML Frameworks: Working knowledge of how large training jobs actually run — NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar. You don’t need to write the models, but you need to understand what’s happening at the systems level when a 1,000-GPU training run stalls. Linux & Systems Internals: Expert-level Linux knowledge: kernel tuning, driver management (NVIDIA drivers, CUDA toolkit), cgroup/namespace internals, performance profiling at the syscall and hardware level. Kubernetes & Orchestration: Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators. Experience with Slurm or other HPC schedulers is equally valued. Automation & Software Engineering: Strong engineering skills in Python, Go, or Bash. You build production-grade tools and services, not just scripts. Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent). Observability & Monitoring: Hands-on experience building monitoring and alerting for GPU infrastructure, not just Prometheus/Grafana basics, but GPU-specific telemetry (DCGM, nvidia-smi, fabric manager metrics) integrated into actionable dashboards. Incident Management: Proven track record leading incident response for complex distributed systems where the failure could be in hardware, firmware, networking, drivers, orchestration, or application code and you need to narrow it down fast. Strong Candidates May Have Distributed Storage: Experience with high-performance parallel file systems (VAST, Weka, Lustre, GPFS) and the checkpoint I/O and data-loading bottlenecks that come with large training runs. Training Optimization: Experience profiling and optimizing distributed training performance: identifying stragglers, tuning collective communication strategies, improving MFU (Model FLOPs Utilization), and reducing idle GPU time across large runs. Cluster Buildout & Hardware: Experience involved in physical cluster design - rack layout, power/cooling constraints, network topology design, and hardware validation/burn-in at scale. Team Leadership: Experience leading or mentoring a team of infrastructure engineers. We\'re growing and need people who raise the bar for everyone around them. Why You’ll Love It Here This is a high-impact, senior builder’s role. You’ll have significant ownership and autonomy to shape how our systems run at a foundational level, working directly with customers and providers while architecting the infrastructure backbone for reliable, scalable AI compute. You’ll influence technical direction and help define what world-class AI infrastructure operations look like. Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. #J-18808-Ljbffr Andromeda

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer - AI Infrastructure in San Francisco, CA vacancy

Senior Manager, Site Reliability Engineering - Infrastructure Platform
$176k - $264k
...Secure Every Identity, from AI to Human Secure Every Identity... ...the trusted, neutral infrastructure that enables organizations to... ...service with great people and reliable, cost‑effective, and efficient... ...with architects and product engineering. Build a world‑class observability...
Senior
Permanent employment
Local area
Worldwide
Flexible hours
Okta, Inc.
San Francisco, CA
2 days ago
Senior Site Reliability Engineer, Identity Platform
$186.07k - $218.9k
...join the IT Operations Corporate Engineering team as a Senior Site Reliability Engineer focused on building and scaling... ...IAM) platform. This team owns the infrastructure that secures how every employee... ...frameworks. Utilizes generative AI responsibly, maintaining human oversight...
Senior
Local area
Coinbase
San Francisco, CA
1 day ago
Senior Site Reliability Engineer - AI Infra & Observability
A tech company specializing in AI is seeking a Site Reliability Engineer to ensure the reliability and observability of their production services. You will instrument services, develop SRE standards, and manage incident response. The ideal candidate has experience in AWS...
Senior
Remote work
Flexible hours
You.com
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...Type Full time Department Engineering Who We Are Hyperbolic Labs... ...on a mission to democratize AI by breaking down the barriers... ...the Role We\'re seeking a Site Reliability Engineer to ensure Hyperbolic... ...\'s GPU marketplace and AI infrastructure operate with exceptional...
Senior
Full time
Hyperbolic
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
$175k - $250k
...250,000.00/yr Job Title: Senior Cloud Infrastructure Engineer Location: San Francisco, CA... ...unavailable. Modality: On-Site only. Must live within... ...interact with generative AI. They are the team behind... ...scalability, performance, and reliability across environments. What...
Senior
Full time
Remote work
Relocation
Relocation package
The Recruiting Guy
San Francisco, CA
2 days ago
Senior Software Engineer, Site Reliability Engineer
$200k - $260k
Senior Software Engineer, Site Reliability Engineer (SRE) Why Harvey At Harvey, we’re transforming how legal and... .... By combining frontier agentic AI, an enterprise‑grade platform, and... ...that sits at the intersection of infrastructure and product, owning the systems that...
Senior
Relocation package
Harvey
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...About the Role SDF is looking for a Senior Site Reliability Engineer to help build and operate the foundation... ...our systems, design and improve the infrastructure behind our production environments,... ...code Experience experimenting with AI‑driven approaches to operations...
Senior
TechChain Talent
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
$170k - $290k
About Luma AI Luma’s mission is to build multimodal AI to expand human imagination... .... This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries of scale... ...for a hands-on, first-principles engineer who is fluent in Linux, comfortable...
Senior
Work experience placement
Luma | Dream Lab
San Francisco, CA
1 day ago
Senior Site Reliability Engineer - SDN
$240k - $312k
...Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands... ...day is currently Tuesday. Engineering at Lambda is responsible... ...toil and improve reliability Collaborate with software,... ...5+ years of experience in Site Reliability Engineering, Production...
Senior
Work at office
Local area
Work from home
Flexible hours
Lambda
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...proactively and reactively improve the reliability of Block's platform and critical infrastructure. You are metrics-driven, systems-... ...and continuously improve AI-driven tooling and automation to... ...desire to perform and grow as an engineer 5+ years of software development...
Senior
Flexible hours
Block, Inc.
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...every organization be more reliable. We do this by building an industry... ...manual toil, improve engineering velocity and developer experience... .... Architect and scale our infrastructure, ensuring best‑in‑class... ...Unlimited token usage and access to AI tools A fast‑moving, high‑...
Senior
Home office
Rootly
San Francisco, CA
1 day ago
Senior Site Reliability Engineer
...the next generation of Gen AI-driven code reviewers: a symbiotic... ...outperforms individual engineers. We combine language models... ...are seeking an experienced Site Reliability Engineer to join our Platform... ...automation platforms, and owning the infrastructure that powers our AI-driven...
Senior
CodeRabbit
San Francisco, CA
1 day ago
Senior Site Reliability Engineer, Fleet Management
$127k - $249k
The Team Platform Engineering is the department within SRE that is... ...responsible for a range of critical infrastructure and operational functions... ...that ensure cluster reliability and security (e.g., CoreDNS,... ...redefined the database for the AI era, enabling innovators to...
Senior
Work at office
Local area
Remote work
Worldwide
Flexible hours
MongoDB
San Francisco, CA
2 days ago
Senior Site Reliability Engineer
$195k - $240k
...You.com, we are building the AI Search Infrastructure that powers modern AI systems... ...infrastructure to make AI systems more reliable, transparent, and useful. Our team includes engineers, researchers, product... .... About the Role As a Site Reliability Engineer, you will...
Senior
Full time
Immediate start
Remote work
Work from home
Flexible hours
You.com
San Francisco, CA
1 day ago
Senior/Staff Site Reliability Engineer
$325k
Engineering at Ivo Engineers At Ivo Are Inventors. Ivo Was First-to-market With An AI agent that lives in MS Word and edits the... ...this) [2025] The Role Infrastructure Engineers build the... ...We’re looking for an Senior or Staff Site level Reliability Engineer as part of the...
Senior
Contract work
Ivo
San Francisco, CA
1 day ago
Senior/Staff Site Reliability Engineer
Role Infrastructure Engineers build the foundation for Ivo’s entire platform. Own and shape the future of our environment. We still have a relatively... ...back, and at Ivo, that’s our mission. We’re building an AI-native platform to automate legal drudgery. People love our...
Senior
Contract work
Ivo
San Francisco, CA
1 day ago
Senior Manager, Site Reliability Engineering
$227.2k - $324.5k
About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations... ..., and guide the careers of senior and emerging talent within... ...infra lead to align Tubi’s infrastructure & SRE roadmap. Partner with... ...and SRE related AI platforms, work with infra...
Senior
Full time
Contract work
Temporary work
Local area
Flexible hours
Tubitv
San Francisco, CA
1 day ago
Senior / Staff Site Reliability, Platform Engineering
...identity security, delivering an AI-powered platform that... ...systems. As a Staff Platform Engineer, you will play a critical role... ...role. You will own reliability for major platform domains,... ...and maintaining the shared infrastructure services and platforms that...
Senior
Saviynt
San Francisco, CA
28 days ago
Senior Site Reliability Engineer
$148.5k - $223.9k
...Job Category Software Engineering Job Details About Salesforce... ...Salesforce is the #1 AI CRM, where humans with... ...is seeking a senior engineering candidate to join the Site Reliability organization in San Francisco... ...counterparts in the Infrastructure and R&D organizations,...
Senior
Full time
Worldwide
Weekend work
Salesforce
San Francisco, CA
8 hours ago
Senior Lead Infrastructure Engineer
...direct and meaningful impact. As a Senior Lead Infrastructure Engineer at JPMorgan Chase within the... ...mitigate risk Uses enterprise-authorized AI capabilities within the work environment... ...health care coverage, on-site health and wellness centers, a retirement...
Senior
J.P. Morgan
San Francisco, CA
4 days ago
Site Reliability Engineer, Inference Infrastructure
...enterprises who are building AI systems to power magical... ...-performance, scalable and reliable machine learning systems? Do... ...applications? We are looking for a Site Reliability Engineer to join the Model Serving... ...and influence the Infrastructure team’s roadmap based on their...
Full time
Work experience placement
Work at office
Remote work
Flexible hours
Cohere
San Francisco, CA
1 day ago
Senior Site Reliability Engineer — Cloud, IaC & Observability
Cooley LLP in San Francisco is looking for a Senior Technology Site Reliability Engineer to ensure the reliability and performance of critical infrastructure. The role blends software and systems engineering for high availability solutions. The ideal candidate has over...
Senior
Cooley LLP
San Francisco, CA
1 day ago
Senior Network & Site Reliability Engineer
$210k - $240k
About The Role We’re building infrastructure that has to perform under real‑world scale, reliability, and security demands - and we’re looking for an engineer who wants to own the foundation it runs on. This isn’t a traditional "keep the lights on" role. You’ll design...
Senior
Alembic Technologies
San Francisco, CA
1 day ago
Senior Site Reliability Engineer: Cloud & Kubernetes Lead
A leading technology firm in San Francisco is seeking a Senior Site Reliability Engineer to maintain and improve cloud infrastructure. The ideal candidate has over 5 years of experience as an SRE or DevOps engineer and strong expertise in Kubernetes. This role focuses on...
Senior
TechChain Talent
San Francisco, CA
1 day ago
Senior Site Reliability Engineer (SRE) - AI Inftastructure
$300k
...stealth-mode startup building out their AI and cloud platform, powered by... ...training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability... ...and automation of this GPU-powered infrastructure, ensuring seamless orchestration across...
Senior
Permanent employment
San Francisco, CA
more than 2 months ago
Site Reliability Engineer (Senior or Staff), Infrastructure Security
$127k - $249k
Senior / Staff Engineer - SRE, InfraSec We are looking for an experienced Senior or Staff Engineer for our SRE, InfraSec team to guide the security of our cloud‑based infrastructure. You will be highly hands‑on technically while also mentoring a small team of SREs. The...
Senior
Local area
Remote work
The Consulting Solutions
San Francisco, CA
3 days ago
Senior Site Reliability Engineer, Infrastructure Foundations
$15 per hour
# Senior Site Reliability Engineer, Infrastructure FoundationsWikimedia FoundationAI Ethics & Tech for GoodEducation Access & Learning EquityOperations, Finance & HRLocationRemoteWork ModeRemoteFound7 days agoExperienceSenior---For the full description, pleasevisit the...
Senior
Permanent employment
For contractors
Currently hiring
Local area
Remote work
Social Impact Guide
San Francisco, CA
1 day ago
Senior Principal Engineer, Enterprise Platform
Lila Sciences in San Francisco is seeking a Principal Engineer for their Enterprise Platform. You will design core identity... ...and architect platform systems to support enterprise-level AI-driven workflows. This senior technical contributor role demands deep expertise in SaaS...
Senior
Lila Sciences
San Francisco, CA
1 day ago
Senior Site Reliability Engineer (GPU Clusters) - Hosting
$250k
...of growth opportunities? Join a rapidly scaling AI cloud infrastructure provider building a next-generation GPU platform designed... ...the United States. The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud...
Senior
Permanent employment
Remote work
San Francisco, CA
a month ago
Site Reliability Engineer, AI Platform
Thinking Machines Lab in San Francisco is seeking a Site Reliability Engineer to enhance the reliability of their Tinker platform. You will collaborate with engineers and research teams to ensure robust systems and prompt incident responses. The ideal candidate holds a...
Visa sponsorship
Thinking Machines Lab
San Francisco, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer - AI Infrastructure. Be the first to apply!