Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Site Reliability Engineer - AI Infrastructure

Andromeda

Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world’s financial markets. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. The Role This is not a generalist SRE role. You will design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems. We’re looking for engineers who have personally run GPU clusters in production, understand the failure modes of distributed training, and can reason about performance from network fabric → kernel → framework. What You’ll Own GPU Cluster Architecture: Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training. Make topology-aware scheduling, networking, and storage decisions that directly impact training throughput and cost efficiency. Customer Technical Partnership: Serve as the primary technical point of contact for customers running large-scale training workloads. Onboard, troubleshoot, and optimize, often in real time. Reliability & Performance Engineering: Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure (ECC errors, NVLink degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training throughput. Networking & Fabric Health: Ensure the health and performance of high-speed interconnects (InfiniBand, RoCE, NVLink) that underpin distributed training. Diagnose and resolve fabric-level issues that degrade collective operations. Observability: Build deep visibility into GPU utilization, memory pressure, interconnect throughput, training job performance, and hardware health. Go well beyond standard infrastructure metrics. Automation & Tooling: Build production-grade automation for cluster provisioning, GPU health checks, job scheduling, self-healing, and firmware/driver lifecycle management. Incident Leadership: Lead incident response for complex, multi-layer failures spanning hardware, networking, orchestration, and ML frameworks. Drive blameless postmortems and systemic fixes. What We’re Looking For GPU Systems Expertise: Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent). You understand GPU memory hierarchies, ECC behavior, thermal throttling, and hardware failure modes from direct experience not documentation. High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training. You can diagnose why an all-reduce is slow, identify a degraded link in a fat-tree topology, and reason about congestion control at scale. Distributed Training & ML Frameworks: Working knowledge of how large training jobs actually run — NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar. You don’t need to write the models, but you need to understand what’s happening at the systems level when a 1,000-GPU training run stalls. Linux & Systems Internals: Expert-level Linux knowledge: kernel tuning, driver management (NVIDIA drivers, CUDA toolkit), cgroup/namespace internals, performance profiling at the syscall and hardware level. Kubernetes & Orchestration: Strong experience running Kubernetes in production with GPU workloads, including device plugins, topology-aware scheduling, multi-cluster federation, and custom operators. Experience with Slurm or other HPC schedulers is equally valued. Automation & Software Engineering: Strong engineering skills in Python, Go, or Bash. You build production-grade tools and services, not just scripts. Infrastructure-as-Code proficiency (Terraform, Helm, Ansible, or equivalent). Observability & Monitoring: Hands-on experience building monitoring and alerting for GPU infrastructure, not just Prometheus/Grafana basics, but GPU-specific telemetry (DCGM, nvidia-smi, fabric manager metrics) integrated into actionable dashboards. Incident Management: Proven track record leading incident response for complex distributed systems where the failure could be in hardware, firmware, networking, drivers, orchestration, or application code and you need to narrow it down fast. Strong Candidates May Have Distributed Storage: Experience with high-performance parallel file systems (VAST, Weka, Lustre, GPFS) and the checkpoint I/O and data-loading bottlenecks that come with large training runs. Training Optimization: Experience profiling and optimizing distributed training performance: identifying stragglers, tuning collective communication strategies, improving MFU (Model FLOPs Utilization), and reducing idle GPU time across large runs. Cluster Buildout & Hardware: Experience involved in physical cluster design - rack layout, power/cooling constraints, network topology design, and hardware validation/burn-in at scale. Team Leadership: Experience leading or mentoring a team of infrastructure engineers. We\'re growing and need people who raise the bar for everyone around them. Why You’ll Love It Here This is a high-impact, senior builder’s role. You’ll have significant ownership and autonomy to shape how our systems run at a foundational level, working directly with customers and providers while architecting the infrastructure backbone for reliable, scalable AI compute. You’ll influence technical direction and help define what world-class AI infrastructure operations look like. Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. #J-18808-Ljbffr Andromeda

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Senior Site Reliability Engineer - AI Infrastructure in San Francisco, CA vacancy
  • $180k - $210k

     ...Location Type Remote Department Tech Engineering Compensation $180K - $210K • Offers Equity...  ..., and Index Ventures, and prominent AI visionaries and founders such as Fei‑Fei...  ...and multimodal AI. About the Role As an Infrastructure Engineer at TwelveLabs, you will design... 
    Senior
    Remote job
    Full time
    H1b
    Work at office
    Worldwide
    Visa sponsorship
    Flexible hours

    Twelve Labs

    San Francisco, CA
    2 days ago
  • $232k - $319k

     ...Secure Every Identity, from AI to Human Identity is the key...  ...the trusted, neutral infrastructure that enables organizations to...  ...service with great people and reliable, cost-effective, and efficient...  ...with architects and product engineering Build a world-class observability... 
    Senior
    Permanent employment
    Local area
    Worldwide
    Flexible hours

    Okta, Inc.

    San Francisco, CA
    4 days ago
  • $127k - $249k

    We are looking for an experienced Senior or Staff Engineer for our SRE, InfraSec team, to guide the security of our cloud-based infrastructure. As a Staff SRE, you will be very hands-on...  ...We have redefined the database for the AI era, enabling innovators to create, transform... 
    Senior
    Full time
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    San Francisco, CA
    2 days ago
  •  ...The TeamPlatform Engineering is the department within SRE that is responsible...  ...for a range of critical infrastructure and operational functions...  ...that ensure cluster reliability and security (e.g., CoreDNS,...  ...redefined the database for the AI era, enabling innovators to... 
    Senior
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    San Francisco, CA
    3 days ago
  • $210.6k - $305.1k

     ...ones they don't own. Powered by AI and an unmatched set of cloud...  ...of FedRAMP-compliant infrastructure and systems, ensuring excellence...  ...led a distributed team of 5+ engineers, can demonstrate strong technical...  ...Please see the Cisco careers site to discover more benefits and... 
    Senior
    Full time
    Temporary work
    Local area
    Flexible hours

    Cisco

    San Francisco, CA
    3 days ago
  • OutSystems, Inc. is looking for a Site Reliability Engineer to join their team in San Francisco, CA. The ideal candidate will lead the onboarding of services and teams to reliability tenets while establishing SLOs and SLAs. Proficiency in Python and experience with Kubernetes... 
    Senior
    Flexible hours

    OutSystems, Inc.

    San Francisco, CA
    4 days ago
  • $227.2k - $324.5k

     ...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional...  ...experienced and visionary Senior SRE Manager to lead and grow...  ...infra lead to align Tubi's infrastructure & SRE roadmap. Partner with...  ...and SRE related AI platforms, work with infra... 
    Senior
    Full time
    Contract work
    Temporary work
    Local area
    Flexible hours

    Tubi

    San Francisco, CA
    2 days ago
  • $140k - $185k

     ...human. Heidi is building an AI Care Partner that works alongside...  .... We’re a team of doctors, engineers, designers, researchers, and...  ...: Improve operational reliability: Own parts of the production...  ...Experience operating cloud infrastructure (AWS preferred). Working knowledge... 
    Senior
    Work at office
    Worldwide

    Dormont Manufacturing Co

    San Francisco, CA
    4 days ago
  •  ...services and teams to the reliability tenets. Establish and maintain...  ..., reliable, and secure infrastructure, ensuring cloud‑native...  ...Program in Python, using Gen AI tooling to accelerate automation...  ...6+ years of experience in Site Reliability Engineering, managing infrastructure... 
    Senior

    OutSystems, Inc.

    San Francisco, CA
    4 days ago
  •  ...practical experience , 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-...  ..., (Desirable) Experience with machine learning infrastructure, model serving, or distributed AI frameworks , (Desirable) Hands-on experience in... 
    Senior

    Fireworks AI

    San Francisco, CA
    4 days ago
  • $163k - $203k

     ...misson   You will be a senior technical contributor...  ..., responsible for the reliability, scalability, and...  ...as much of a platform engineering role as it is SRE role...  ...are building an agentic AI-first operations model...  ...(managed by the Infrastructure Engineering team) across... 
    Senior
    Work experience placement
    Work at office
    Local area
    Remote work
    Flexible hours
    2 days per week

    Prosper.com

    San Francisco, CA
    3 days ago
  • An innovative AI infrastructure startup is seeking a Sales Engineer to lead technical discovery and drive successful evaluations with clients. The ideal candidate will have significant experience in customer-facing technical roles focused on AI and machine learning infrastructure... 
    Senior
    Remote work

    Andromeda

    San Francisco, CA
    1 day ago
  • $200k - $260k

    A leading AI infrastructure company is seeking a Senior Platform Engineer to manage the API and infrastructure layer for voice workloads. This pivotal role involves building real-time WebSocket and APIs and designing autoscaling for voice applications. Candidates should... 
    Senior

    Together AI

    San Francisco, CA
    1 day ago
  • A dynamic technology firm is seeking a Senior Software Program Engineer to lead software development efforts. You will work collaboratively across teams...  ...years of engineering experience and be eager to leverage AI tools. The role is fully remote, allowing for flexible... 
    Senior
    Remote work
    Flexible hours

    The10minutecareersolution

    San Francisco, CA
    1 day ago
  • $139.2k - $174k

    A leading cloud services provider is looking for a Senior Engineer 2 to join their AI Infrastructure Control Plane team. This role involves architecting high-quality software solutions for AI workloads while driving design and operational excellence. Candidates should have... 
    Senior
    Remote work

    DigitalOcean

    San Francisco, CA
    1 day ago
  • A technology company based in San Francisco is seeking an experienced Platform Engineer to develop user-facing features for their innovative AI Hardware platform. The role requires strong proficiency in TypeScript, Node.js, and React, along with a commitment to collaboration... 
    Senior
    Remote work

    Flux Enterprise

    San Francisco, CA
    1 day ago
  •  ...A tech company specializing in AI infrastructure is seeking a Senior Product Manager for Runpod Anywhere. The role involves leading the entry into a new market and building a product line from scratch, along with engaging customers to refine product strategy. Applicants... 
    Senior
    Remote work

    Runpod

    San Francisco, CA
    6 days ago
  • A leading IoT technology company is seeking a Senior Software Engineer to design and develop core AI platform capabilities. This remote position emphasizes building scalable, reliable systems for AI-driven applications. The ideal candidate will have 6+ years of software... 
    Senior
    Remote work

    Samsara

    San Francisco, CA
    1 day ago
  • $170k - $220k

     ...Job Title: Founding Senior Engineer, Infrastructure Location: San Francisco, CA (Onsite) Salary: $170k-220k Overview: An emerging AI company in San Francisco is seeking a talented Founding Senior Engineer (Infrastructure) to help shape and build our foundational... 
    Senior
    Immediate start

    She Recruits LLC

    San Francisco, CA
    2 days ago
  •  ...Applied AI Lab Job Compensation: Competitive base salary...  ...month and growing. You'll own reliability, performance, and security for...  ...secure, multi-tenant container infrastructure with fast startup and smart...  ...LLMs. Why Julius Small, senior team; massive impact surface;... 
    Senior
    Remote work

    Julius

    San Francisco, CA
    16 hours ago
  • $204k - $259k

     ...Senior Software Engineer, Simulation ML Infrastructure Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver...  ...Simulation ML Infrastructure team builds scalable AI/ML infrastructure to accelerate the Simulator team in... 
    Senior
    Full time
    Remote work

    Waymo

    San Francisco, CA
    2 days ago
  • A leading B2B marketing platform is hiring a Senior Platform Engineer to construct and maintain critical backend systems. This role involves developing...  ...engineers. Qualified seeks candidates with expertise in AI-assisted software development and at least 5 years of... 
    Senior

    Qualified.io

    San Francisco, CA
    4 days ago
  • $190k - $280k

     ...and our team is building its AI-native future. About the...  ...enjoyable by maintaining a fast, reliable continuous integration...  ...a member of the Developer Infrastructure team, you'll be responsible...  ...Sentry. You'll support a growing engineering organization by balancing... 
    Senior
    Hourly pay
    Local area

    Sentry

    San Francisco, CA
    4 days ago
  •  ...Insight Global is seeking a Network Engineer – Reliability & Observability to support the quality...  ...performance of large-scale AI network infrastructure. This role serves as a reliability engineering...  ...strategic sourcing. Experienced Site Reliability Engineers (SREs) with a... 
    Senior

    Insight Global

    San Francisco, CA
    2 days ago
  • $180k - $250k

     ...Senior Infrastructure Engineer Title of Role: Senior Infrastructure Engineer Location: San Francisco...  ...Company Stage of Funding: Seed - AI, Devtools, Enterprise, Data Office...  .... Monitor system performance and reliability using Prometheus and Grafana,... 
    Senior
    Work at office

    Recruiting from Scratch

    San Francisco, CA
    3 days ago
  • $350 per month

     ...Senior Infrastructure Engineer We're looking for a Senior Infrastructure Engineer to own and scale the...  ...directly with founders daily to push reliability and performance. This is a high-ownership...  ...problems at a company at the edge of AI. Your responsibilities will... 
    Senior
    Temporary work
    Remote work

    AgentHub Inc.

    San Francisco, CA
    3 days ago
  •  ...of hardware, by developing the first AI Hardware Engineer. Our goal is to democratize the ability...  ...around it has to just work. As a Senior Platform Engineer, you’ll own the full...  ...React frontend components. Improve reliability and performance of core platform flows... 
    Senior
    Remote work
    Shift work

    Flux Defunct

    San Francisco, CA
    3 days ago
  •  ...Monaco is building an AI-native revenue platform that replaces the fragmented GTM stack (CRM, sequencing, call recording, enrichment...  ...Role You'll own the internal developer platform that every engineer at Monaco builds on - the systems, environments, and tooling... 
    Senior
    Work at office
    Local area
    Remote work
    Shift work

    Monaco

    San Francisco, CA
    3 days ago
  •  ...This is a job that Jill, our AI Recruiter, is recruiting for on behalf of one of our customers. She will pick...  ...network The next step is to speak to Jack. Job Title: Senior Platform and Infrastructure Engineer Company Description: Context - Lux Capital and... 
    Senior
    Live in

    Jack and Jill AI

    San Francisco, CA
    1 day ago
  •  ...Senior Infrastructure Engineer Hybrid in San Francisco 200-250k +/-, + equity + comprehensive benefits 2Bridge is partnered with an AI-powered Medical Information Platform that's transforming...  ...systems and ensure platform reliability at a global level. About the... 
    Senior

    2Bridge Partners

    San Francisco, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Site Reliability Engineer - AI Infrastructure. Be the first to apply!