Staff AI Infrastructure Engineer

$241k - $331k

Biohub

Staff AI Infrastructure Engineer

Redwood City, CA (Hybrid)

Biohub is the first large-scale initiative bringing frontier AI models, massive compute, and frontier experimental capabilities under one roof. We're building a general-purpose system to accelerate scientific discovery, integrating frontier AI models, biological foundation models, and lab capabilities, with the ultimate goal of curing disease. Our technology powers scientists around the world, translating AI capabilities into tools that accelerate research everywhere.

The Team

The AI Cluster Production Engineering team is part of the AI Compute Platform organization at Biohub, a non-profit research lab committed to open science and open-source AI. We own the design, operation, and reliability of large-scale multi-GPU AI clusters that power frontier AI biology research: protein language models, genomic foundation models, and scientific reasoning systems built to be shared, not monetized. Our clusters run Slurm on Kubernetes infrastructure and support everything from day-to-day AI researcher workflows to multi-node hero training runs at thousands of GPUs. The team works at the intersection of AI tooling, distributed systems, HPC, and frontier AI, debugging deep AI infrastructure problems and building AI systems critical to the entire AI organization.

The Opportunity

CZ Biohub's mission is to cure or prevent all human disease. Achieving that requires training frontier-scale AI biology models, and that demands reliable, high-performance compute infrastructure. This is production engineering work at a frontier AI lab, with the twist that the mission is biology and the science is open. You'll keep GPU clusters running at high utilization, debug the toughest distributed systems failures, and build the operational foundations for scaling to multi-thousand GPU hero runs. The technical problems are genuinely hard (e.g., multi-node distributed training, InfiniBand fabrics, large-scale storage, Slurm at scale) inside an organization where the work is aimed at helping people, not optimizing ad revenue.

What You'll Do

Own reliability, observability, and incident response for multi-site GPU clusters running Slurm on Kubernetes. Build the systems, automation, and processes that keep clusters healthy, and that enable fast, efficient recovery when things break.
Debug and resolve deep infrastructure failures across storage, networking, scheduling, and GPU compute layers. Build the tooling and operational patterns that make these failures easier to detect, diagnose, and prevent.
Design and execute GPU cluster scaling plans, systematically validating storage, networking, interconnect, and scheduler behavior as clusters grow to support larger training runs.
Build automation and tooling to manage cluster operations at scale: capacity planning, GPU utilization monitoring workload manager policy management, and pod lifecycle automation.
Drive configuration-as-code practices, ensuring cluster state is reproducible and auditable, and managed through version-controlled pipelines.
Collaborate directly with AI researchers and hero run leads to understand training workload patterns and design infrastructure that meets frontier-scale requirements.
Own the vendor relationship on technical issues — escalating SEV1s, coordinating across multiple partners and network backbone teams, holding them accountable to root/proximate cause analysis and SLAs.
Contribute to capacity planning: projecting GPU demand, managing cluster expansion across GPU generations, and coordinating multi-cluster strategy.
Improve operational resilience, reducing mean time to detect and resolve incidents, reducing toil through automation, and developing runbooks that scale the team's operational knowledge beyond any individual.

What You'll Bring

8+ years of AI/ML infrastructure engineering experience, with deep expertise in at least one of: HPC/Slurm cluster operations, Kubernetes at scale, distributed systems debugging, or GPU compute infrastructure.
Strong Linux systems fundamentals — networking (TCP/IP, InfiniBand, RDMA, MTU/MSS/PMTUD), storage (NFS, VAST, WEKA, POSIX semantics), kernel internals (cgroups, namespaces, eBPF, sysctls).
Hands-on experience with Kubernetes and cloud-native infrastructure — pod lifecycle, CNI plugins (Cilium preferred), StatefulSets, Helm, ArgoCD, or equivalent GitOps tooling.
Experience with HPC workload managers — Slurm strongly preferred (QoS, partitions, preemption, accounting, Sunk/CoreWeave patterns a plus).
Debugging instinct: ability to form hypotheses quickly, design controlled experiments, and root cause complex multi-system failures under pressure. You enjoy finding the hard bugs.
Proficiency in Python and Bash for automation and tooling. Go, Rust, or C/C++ a plus.
Experience with observability stacks — Prometheus/VictoriaMetrics, Grafana, DCGM metrics, distributed tracing. You know how to instrument systems you don't control.
Excellent communication — you can write a crisp incident summary for researchers, a technical escalation to a vendor CTO, and a system design doc for teammates, all in the same day.
Bonus: experience with distributed AI training infrastructure (NCCL, PyTorch DDP, multi-node job debugging, checkpoint/restart patterns, container environments for large-scale training).

Compensation

The Redwood City, CA base pay range for a new hire in this role is $241,000 - $331,000. New hires are typically hired into the lower portion of the range, enabling employee growth in the range over time. Actual placement in range is based on job-related skills and experience, as evaluated throughout the interview process.

Better Together

As we grow, we're excited to strengthen in-person connections and cultivate a collaborative, team-oriented environment. This role is a hybrid position requiring you to be onsite for at least 60% of the working month, approximately 3 days a week, with specific in-office days determined by the team's manager. The exact schedule will be at the hiring manager's discretion and communicated during the interview process.

Benefits for the Whole You

We're thankful to have an incredible team behind our work. To honor their commitment, we offer a wide range of benefits to support the people who make all we do possible.

Provides a generous employer match on employee 401(k) contributions to support planning for the future.
Paid time off to volunteer at an organization of your choice.
Funding for select family-forming benefits.
Relocation support for employees who need assistance moving

If you're interested in a role but your previous experience doesn't perfectly align with each qualification in the job description, we still encourage you to apply as you may be the perfect fit for this or another role.

Apply

Vacancy posted 3 days ago

Similar jobs that could be interesting for youBased on the Staff AI Infrastructure Engineer in Redwood City, CA vacancy

AI Training Infrastructure Engineer — Remote
A leading AI company in Redwood City is seeking an Applied Research Engineer to manage GPU cluster infrastructure and build resilient systems for model training. This role requires hands-on experience with cloud clusters, orchestration tools like Kubernetes, and solid Python...
Suggested
Remote job
Snorkel AI
Redwood City, CA
1 day ago
Agentic AI Platform Engineer
...tackle complex technical challenges that push the boundaries of what AI can do: Build no-code agentic orchestration frameworks... ...and reduce hallucinations for nuanced business scenarios. Engineer scalable backend services that power intuitive UIs for seamless...
Suggested
Remote work
Wisq
Redwood City, CA
16 hours ago
Agentic AI Platform Engineer — Remote/Hybrid
Wisq, Inc. in Redwood City, CA is seeking experienced AI engineers to tackle complex challenges in AI technology. The role demands expertise in LLM fine-tuning, distributed systems, and ML solutions, while working in a collaborative and hybrid/remote environment. Candidates...
Suggested
Remote job
Wisq, Inc.
Redwood City, CA
1 day ago
Senior AI Platform Engineer: Scale Synthetic Data & LLM Ops
Snorkel is seeking a Senior Software Engineer for its AI Platform in Redwood City, CA, focusing on architecting solutions for synthetic data generation and large-scale AI systems. This hybrid role calls for extensive experience in cloud-native software systems and deep...
Suggested
jobs.frontdoordefense.com - Jobboard
Redwood City, CA
4 days ago
AI‑Driven Test Platform Engineer
$150k - $200k
A pioneering technology company is looking for a software engineer to enhance their developer infrastructure at their Redwood City office. The ideal candidate will manage CI/CD pipelines, automate testing procedures, and design device simulators to improve the efficiency...
Suggested
Full time
Work at office
Epoch Biodesign
Redwood City, CA
4 days ago
Senior AI Automation Engineer - Platform Workflows
$200k - $290k
A cutting-edge AI startup is seeking a full-stack engineer to build internal systems that automate critical workflows across various departments. This role involves designing tools for HR, finance, and operations while also integrating third-party systems. The ideal candidate...
Retell AI
Redwood City, CA
1 day ago
AI Infrastructure Engineer, Distributed Training, Optimus
$124k - $420k
...What to Expect As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture, visualize data, assist with exporting and deploying neural networks to the bot, and evaluate experimental...
Hourly pay
Full time
Temporary work
Flexible hours
Tesla
Palo Alto, CA
3 days ago
AI Quality Infrastructure Engineer
...Role: AI Quality Infrastructure Engineer Preferred Location: Mountain View CA, San Diego CA, NYC, Plano TX As an AI Quality Infrastructure Engineer, you will build quality infrastructure and build quality pipelines with these to guarantee the reliability of our...
Shift work
United IT Solutions
Mountain View, CA
3 days ago
Sr. Cloud AI Infrastructure Engineer
$145.1k - $273.2k
...the underlying hardware logic of various AI accelerators ; evaluate the power-... ...implementation of emerging technologies within cloud infrastructure. Who We Look For 1.Education: Master's or Ph.D. degree in Computer Engineering, Electronic Engineering, Microelectronics...
Relocation package
Tencent
Palo Alto, CA
2 days ago
Senior Staff AI Engineer, Network Growth AI
$191k - $315k
...Network Growth and Relationship AI team is at the forefront of... ...with the product, engineering and data science team and has... ...years of technical leadership (Staff+) experience, including recent... ...experience with large scale ML data infrastructure ~ Experience with developing...
For contractors
Work at office
Flexible hours
LinkedIn
Mountain View, CA
1 day ago
MTS - AI Platform Engineer
$119.8k - $234.7k
...Microsoft continues to push the boundaries of AI, we are on the lookout for passionate... .... At Microsoft AI, our Platform Engineering team is building AIthat'snot just powerful... ...Work collaboratively with other Platform, infrastructure, application engineers as well as AI...
Ongoing contract
Work at office
Local area
Microsoft Corporation
Mountain View, CA
1 day ago
Sr. AI Engineer, Platform Infrastructure, Special Programs
$220k - $350k
...actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mars. SR AI ENGINEER, PLATFORM INFRASTRUCTURE, SPECIAL PROGRAMS As an AI Engineer, Platform Infrastructure you will build the tooling, and work with our cleared...
Permanent employment
Temporary work
Immediate start
Weekend work
SpaceX
Palo Alto, CA
3 days ago
Senior AI Agent Infrastructure Engineer
A leading AI healthcare solutions company in Mountain View is seeking a Senior/Staff Software Engineer to innovate in building AI agent infrastructure for healthcare operations. The ideal candidate has over 7 years of experience in developing AI systems and a strong product...
Full time
Joinhoneyhealth
Mountain View, CA
16 hours ago
Shopify Commerce AI Infrastructure Engineer
$10,000 per month
Career is seeking a full-stack builder for a 12-week residency in Mountain View, California, focusing on creating a Shopify app that transforms messy catalogs into machine-readable data. Ideal candidates should have robust expertise in Python and modern web frameworks, ...
career
Mountain View, CA
16 hours ago
Principal AI Platform Engineer
$201k - $261k
...Coupa makes margins multiply through its community-generated AI and industry-leading total spend management platform for businesses... ...experiences working at Coupa. The Impact of a Principal Engineer at Coupa: At Coupa, we're building a future where Agentic...
Work at office
Remote work
Coupa Software
Foster, CA
1 day ago
Senior Lead Software Engineer- AI Platform engineer
...Senior Lead Software Engineer Be an integral part of an agile team that's constantly... ...JPMorgan Chase within the Corporate Sector, Infrastructure Platforms team, you are an integral part... ...infrastructure platforms optimized for AI and machine learning workloads....
For contractors
Chase
Palo Alto, CA
3 days ago
AI Engineer — LLMs & Cloud, Hybrid Role
Getaida, an innovative B2B enterprise AI company based in Palo Alto, is looking for a strong AI Engineer. The ideal candidate will have a solid background in building Large Language Models (LLMs) and an eye for new emerging models. This position involves collaborating...
Getaida
Palo Alto, CA
1 day ago
Cloud AI Threat Detection Engineer
$180k
...firm based in California seeks a skilled Detection & Response Engineer to enhance its security operations. This role involves detecting and responding to security incidents across cloud-native infrastructures, as well as conducting investigations and developing...
Pantera Capital
Palo Alto, CA
2 days ago
Senior AI/ML & Infrastructure Engineer
$174k - $252k
Google Inc. is seeking a Senior Software Engineer in Mountain View, CA. The role involves... ...generation technologies and implementing AI/ML solutions. Candidates should have 5... ...in software design and machine learning infrastructure. The position offers a competitive salary...
Google Inc.
Mountain View, CA
1 day ago
AI Quality Infrastructure Engineer Job in USA 2026 with Visa Sponsorship
AI Quality Infrastructure Engineer Job in USA 2026 with Visa Sponsorship AI Quality Infrastructure Engineer Job in USA 2026 with Visa Sponsorship A global technology delivery partner is hiring an AI Quality Infrastructure Engineer for enterprise AI operations in Mountain...
Permanent employment
Full time
Contract work
H1b
Relocation
Visa sponsorship
Work visa
NewsNowGh
Mountain View, CA
1 day ago
Senior AI/ML Engineer, Build Platform
$170k - $240k
...successful candidate is in the Seattle, Washington area. About Us The AI Cloud and Developer Infrastructure organization is responsible for delivering and maintaining the tools and services engineers here at GM use every day to do their best work and drive our cars...
Work experience placement
Work at office
Local area
Remote work
Work from home
Relocation package
Flexible hours
3 days per week
General Motors
Mountain View, CA
3 days ago
Senior AI/ML Platform Engineer
$148k - $247k
...software. Our team is at the forefront of AI, cloud, and data platform adoption,... ...teamwork. ¹ As a Senior AI/ML Platform Engineer, you will architect and scale the ML platform... ...monitoring. Design and implement infrastructure for model training, hyperparameter tuning...
Full time
Part time
Immediate start
Flexible hours
Guidewire
San Mateo, CA
3 days ago
AI Platform Engineer - Full-Stack for Enterprise AI
$73.8k - $220.4k
Accenture in Mountain View, CA, is seeking an Advanced AI Full Stack Engineer to design foundational systems powering AI agent platforms. This role involves building orchestration runtimes, implementing sandboxed environments, and developing tools that facilitate AI programming...
Accenture
Mountain View, CA
4 days ago
Senior AI Software Engineer - Healthcare Platform
Commure, located in Mountain View, California, is seeking a skilled software engineer to join our Air AI team. This role focuses on leveraging AI technologies to enhance clinical workflows and deliver automation solutions to clinicians. Candidates must have a Bachelor's...
Commure
Mountain View, CA
2 days ago
Autonomous AI Platform Engineer — Equity & Impact
A cutting-edge advertising technology firm in California is seeking a Machine Learning Engineer to build the infrastructure for autonomous agents. The role involves architecting data systems, collaborating with data scientists on MLOps, and creating user-facing applications...
MAI
Mountain View, CA
16 hours ago
AI Voice Platform Engineer — Product-Focused & Real-Time
Sage Care Inc is seeking a product-minded AI Engineer in Palo Alto, California, to build and improve their AI Voice platform. You will work on enhancing customer-facing systems, ship features, debug real-time issues, and collaborate with experienced engineers and product...
Sage Care Inc
Palo Alto, CA
1 day ago
Staff AI/ML Engineer - CI Platform
$170k - $300k
...Washington area. About Us The AI Cloud and Developer Infrastructure organization is responsible for delivering... ...maintaining the tools and services engineers here at GM use every day to do their... ...The Role We are looking for a Staff Engineer with an extensive...
Work experience placement
Work at office
Local area
Remote work
Work from home
Relocation package
Flexible hours
3 days per week
General Motors
Mountain View, CA
4 days ago
Head of AI Engineering & ML Platform
A leading AI-driven commerce solutions provider based in Mountain View, California, seeks an Engineering Leader to guide the incubation of next-generation SaaS solutions. The... ...team, and design scalable AI infrastructures ensuring impactful results for notable...
CommerceIQ
Mountain View, CA
1 day ago
Director of AI & Platform Engineering
Voiceflow is seeking an experienced Director of Engineering to lead our engineering organization focused on AI and infrastructure. This role involves defining technical vision, scaling systems, and mentoring teams. The ideal candidate will have over 12 years in software...
Voiceflow
Palo Alto, CA
1 day ago
Senior Director, AI Platforms & LLM Engineering
$248.7k - $342k
Uniphore is seeking a Senior Director of Engineering to lead the architecture and delivery of AI-driven products utilizing Large Language Models (LLMs). This role involves managing multiple teams and driving innovation in conversational AI systems. Ideal candidates will...
Uniphore
Palo Alto, CA
3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff AI Infrastructure Engineer. Be the first to apply!