Staff AI Infrastructure Engineer

$241k - $331k

Biohub

Biohub is the first large-scale initiative bringing frontier AI models, massive compute, and frontier experimental capabilities under one roof. We're building a general-purpose system to accelerate scientific discovery, integrating frontier AI models, biological foundation models, and lab capabilities, with the ultimate goal of curing disease. Our technology powers scientists around the world, translating AI capabilities into tools that accelerate research everywhere. The Team The AI Cluster Production Engineering team is part of the AI Compute Platform organization at Biohub, a non-profit research lab committed to open science and open-source AI. We own the design, operation, and reliability of large-scale multi‑GPU AI clusters that power frontier AI biology research: protein language models, genomic foundation models, and scientific reasoning systems built to be shared, not monetized. Our clusters run Slurm on Kubernetes infrastructure and support everything from day‑to‑day AI researcher workflows to multi‑node hero training runs at thousands of GPUs. The team works at the intersection of AI tooling, distributed systems, HPC, and frontier AI, debugging deep AI infrastructure problems and building AI systems critical to the entire AI organization. The Opportunity CZ Biohub's mission is to cure or prevent all human disease. Achieving that requires training frontier-scale AI biology models, and that demands reliable, high-performance compute infrastructure. This is production engineering work at a frontier AI lab, with the twist that the mission is biology and the science is open. You'll keep GPU clusters running at high utilization, debug the toughest distributed systems failures, and build the operational foundations for scaling to multi‑thousand GPU hero runs. The technical problems are genuinely hard (e.g., multi‑node distributed training, InfiniBand fabrics, large-scale storage, Slurm at scale) inside an organization where the work is aimed at helping people, not optimizing ad revenue. What You'll Do Own reliability, observability, and incident response for multi‑site GPU clusters running Slurm on Kubernetes. Build the systems, automation, and processes that keep clusters healthy, and that enable fast, efficient recovery when things break. Debug and resolve deep infrastructure failures across storage, networking, scheduling, and GPU compute layers. Build the tooling and operational patterns that make these failures easier to detect, diagnose, and prevent. Design and execute GPU cluster scaling plans, systematically validating storage, networking, interconnect, and scheduler behavior as clusters grow to support larger training runs. Build automation and tooling to manage cluster operations at scale: capacity planning, GPU utilization monitoring workload manager policy management, and pod lifecycle automation. Drive configuration‑as‑code practices, ensuring cluster state is reproducible and auditable, and managed through version‑controlled pipelines. Collaborate directly with AI researchers and hero run leads to understand training workload patterns and design infrastructure that meets frontier‑scale requirements. Own the vendor relationship on technical issues — escalating SEV1s, coordinating across multiple partners and network backbone teams, holding them accountable to root/proximate cause analysis and SLAs. Contribute to capacity planning: projecting GPU demand, managing cluster expansion across GPU generations, and coordinating multi‑cluster strategy. Improve operational resilience, reducing mean time to detect and resolve incidents, reducing toil through automation, and developing runbooks that scale the team's operational knowledge beyond any individual. What You'll Bring 8+ years of AI/ML infrastructure engineering experience, with deep expertise in at least one of: HPC/Slurm cluster operations, Kubernetes at scale, distributed systems debugging, or GPU compute infrastructure. Strong Linux systems fundamentals — networking (TCP/IP, InfiniBand, RDMA, MTU/MSS/PMTUD), storage (NFS, VAST, WEKA, POSIX semantics), kernel internals (cgroups, namespaces, eBPF, sysctls). Hands‑on experience with Kubernetes and cloud‑native infrastructure — pod lifecycle, CNI plugins (Cilium preferred), StatefulSets, Helm, ArgoCD, or equivalent GitOps tooling. Experience with HPC workload managers — Slurm strongly preferred (QoS, partitions, preemption, accounting, Sunk/CoreWeave patterns a plus). Debugging instinct: ability to form hypotheses quickly, design controlled experiments, and root cause complex multi‑system failures under pressure. You enjoy finding the hard bugs. Proficiency in Python and Bash for automation and tooling. Go, Rust, or C/C++ a plus. Experience with observability stacks — Prometheus/VictoriaMetrics, Grafana, DCGM metrics, distributed tracing. You know how to instrument systems you don't control. Excellent communication — you can write a crisp incident summary for researchers, a technical escalation to a vendor CTO, and a system design doc for teammates, all in the same day. Bonus: experience with distributed AI training infrastructure (NCCL, PyTorch DDP, multi‑node job debugging, checkpoint/restart patterns, container environments for large‑scale training). Compensation The Redwood City, CA base pay range for a new hire in this role is $241,000 - $331,000 . New hires are typically hired into the lower portion of the range, enabling employee growth in the range over time. Actual placement in range is based on job‑related skills and experience, as evaluated throughout the interview process. Better Together As we grow, we’re excited to strengthen in‑person connections and cultivate a collaborative, team‑oriented environment. This role is a hybrid position requiring you to be onsite for at least 60% of the working month, approximately 3 days a week, with specific in‑office days determined by the team’s manager. The exact schedule will be at the hiring manager's discretion and communicated during the interview process. Benefits for the Whole You We’re thankful to have an incredible team behind our work. To honor their commitment, we offer a wide range of benefits to support the people who make all we do possible. Provides a generous employer match on employee 401(k) contributions to support planning for the future. Paid time off to volunteer at an organization of your choice. Funding for select family‑forming benefits. Relocation support for employees who need assistance moving #J-18808-Ljbffr Biohub

Apply

Vacancy posted 2 days ago

Similar jobs that could be interesting for youBased on the Staff AI Infrastructure Engineer in Redwood City, CA vacancy

AI-Scale Test Infrastructure Engineer
$145k - $215k
A leading AI technology company in Redwood City is seeking a Software Engineer in Test. You will refine testing infrastructure, maintain automated testing systems, and improve developer productivity. Ideal candidates have 3-5 years of experience in scaling test frameworks...
Suggested
Snorkel AI
Redwood City, CA
2 days ago
Federal Forward-Deployed AI Engineer - Platform & Apps
$120k - $158k
C3 AI is looking for Forward Deployed Engineers in Redwood City, CA. This role involves developing full-stack AI enterprise applications, mentoring junior engineers, and conducting code reviews. Candidates should have a Bachelor's degree in a STEM field and 2+ years of...
Suggested
C3 AI
Redwood City, CA
2 days ago
AI Platform Engineer - Cloud Infra
...seeking a Senior Member of Technical Staff in Redwood City to drive AI/ML integration in platform... ...extensive experience in software engineering with a focus on automation and operational... ...to shape the future of cloud infrastructure. #J-18808-Ljbffr salesforce.com,...
Suggested
salesforce.com, inc.
Redwood City, CA
2 days ago
Senior Generative AI Platform Engineer - Equity Eligible
$145k - $187k
C3.ai, Inc. (NYSE:AI) is a leading Enterprise AI software provider for accelerating... ...We are looking for a seasoned software engineer experienced in the field of machine learning... ...you will be tasked with developing the infrastructure and tools to improve the state-of-the-...
Suggested
Work experience placement
C3 AI
Redwood City, CA
2 days ago
Customer-Facing AI Engineer for Banking Platforms
Cross Border Talents seeks a Forward Deployed Engineer to work at the intersection of engineering, product, and customer success for major... ...product and engineering teams, and help transform loan processing workflows with AI-driven #J-18808-Ljbffr Cross Border Talents
Suggested
Cross Border Talents
Redwood City, CA
2 days ago
Staff AI Platform Engineer
$200.9k - $257.5k
...Labs is building a world‑class AI ecosystem to solve the most... ...a hands‑on AI Implementation Engineer to build the foundation for the... ...onboarding process. Infrastructure & Operations: Design, build,... ...feel comfortable working with staff of all levels, including executive...
Local area
Altos Labs
Redwood City, CA
2 days ago
Senior AI Platform Engineer: Scale Synthetic Data & LLM Ops
Snorkel is seeking a Senior Software Engineer for its AI Platform in Redwood City, CA, focusing on architecting solutions for synthetic data generation and large-scale AI systems. This hybrid role calls for extensive experience in cloud-native software systems and deep...
jobs.frontdoordefense.com - Jobboard
Redwood City, CA
2 days ago
Senior AI Platform Engineer — LLM & Agentic Systems
$192k - $240k
A leading AI solutions provider in San Francisco is seeking a Senior AI/ML Engineer to architect the core platform for synthetic data generation and agentic workflows. The ideal candidate has over 6 years of experience in cloud-native software systems, possesses deep knowledge...
Work at office
Snorkel AI
Redwood City, CA
2 days ago
Senior AI Data Platform & Analytics Engineer
Vantaca, LLC is seeking a Data Platform Engineer in Redwood City, California. This role involves building the analytics layer of the HOAi platform, focusing on data transformation, customer analytics, and usage-based pricing. Successful candidates will have over 5 years...
Remote work
Flexible hours
Vantaca, LLC
Redwood City, CA
2 days ago
Senior AI/ML Platform Engineer - Synthetic Data & RL
Gravity Engineering Services Pvt Ltd. is seeking an experienced architect to lead the development of cloud-native software systems. The ideal... ...candidate will have at least 5 years of experience focusing on AI/ML pipelines, distributed computing, and proficiency in Python...
Work at office
3 days per week
Gravity Engineering Services Pvt Ltd.
Redwood City, CA
2 days ago
Senior AI Systems Engineer - Agentic Multi-Agent Platforms
Cognichip in Redwood City is seeking a Senior Software Engineer specializing in agentic AI systems. This role involves designing, implementing, and deploying advanced AI workflows using state-of-the-art frameworks. Candidates should have 5-10 years of experience in software...
Cognichip
Redwood City, CA
2 days ago
Enterprise AI Tools & Platform Engineer
$200.9k - $257.5k
Altos Labs in Redwood City is seeking an AI Implementation Engineer to enhance AI tool adoption across the company, working closely with diverse... ...role involves collaborating on AI strategies, designing infrastructure, and developing best practices for using AI tools. The...
Altos Labs
Redwood City, CA
2 days ago
Lead Software Engineer- AI Platform Engineer
...limits of what's possible. As a Lead Software Engineer at JP Morgan Chase within the Corporate Sector, Infrastructure Platforms team, you are an integral part of an... ...software applications and systems. Collaborate with AI teams to translate computational requirements...
J.P. Morgan
Palo Alto, CA
17 days ago
Senior Lead Software Engineer- AI Platform engineer
...technology products. As a Senior Lead Software Engineer at JPMorgan Chase within the Corporate Sector, Infrastructure Platforms team, you are an integral part of an... ...scalable cloud infrastructure platforms optimized for AI and machine learning workloads. Collaborate...
For contractors
J.P. Morgan
Palo Alto, CA
17 days ago
Frontier AI: RL Task Infrastructure Engineer
Symbal is seeking a Software Engineer to accelerate its RL task infrastructure at the intersection of engineering, task design, and system architecture. This specialist role reports directly to the CTO and carries ownership of a core technical domain. Candidates should...
davidjoseph-co
Palo Alto, CA
2 days ago
Senior AI Infrastructure Engineer for GPU Clusters
Clockwork.io is seeking an experienced systems engineer to design and implement low-level systems software for GPU clusters. You will work on distributed systems and collaborate with technologies such as PyTorch and CUDA. Ideal candidates will have extensive experience...
Clockwork.io
Palo Alto, CA
2 days ago
Senior AI Infrastructure Engineer - Large-Scale Training
AI Pulse is seeking a Sr. Software Engineer, Model Scaling, in Palo Alto, CA, to work on optimizing large-scale distributed training systems. The role encompasses responsibilities such as improving training efficiency and developing tools to identify bottlenecks in the...
AI Pulse
Palo Alto, CA
2 days ago
Senior AI Compute Infrastructure Engineer
A tech company specializing in AI infrastructure is seeking a Software Engineer to build a scalable compute platform for its generative video models. The ideal candidate will have over 5 years of experience in MLOps or AI infrastructure management, along with strong Python...
HeyGen
Palo Alto, CA
5 days ago
AI Infrastructure Engineer, Digital Optimus
$140k - $252k
...screenshot-based VLM agents, with the larger goal of integrating with Tesla's broader AI ecosystem. We're seeking an ML/RL Infra Engineer to build scalable, reliable infrastructure that powers these agents and enables seamless, high-volume rollouts for model evaluation...
Hourly pay
Full time
Temporary work
Flexible hours
Tesla Motors, Inc.
Palo Alto, CA
2 days ago
AI Infrastructure Engineer, Distributed Training, Optimus
$124k - $420k
What To Expect As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture, visualize data, assist with exporting and deploying neural networks to the bot, and evaluate experimental...
Hourly pay
Full time
Temporary work
Flexible hours
Tesla
Palo Alto, CA
2 days ago
AI Infrastructure Engineer — 1-Year Residency
A cutting-edge AI infrastructure company in Palo Alto is launching a full-time, paid, 1-year residency for aspiring AI infrastructure engineers. Candidates will rotate through various aspects of AI infrastructure work, including inference and training, and collaborate...
Permanent employment
Full time
RadixArk
Palo Alto, CA
2 days ago
Distributed AI Training Infrastructure Engineer
Tesla is looking for a Software Engineer in Palo Alto to join the Autopilot AI Infrastructure team. In this role, you'll optimize and scale infrastructure components that support AI research for Autopilot and Optimus. Key responsibilities include writing Python code, debugging...
Tesla
Palo Alto, CA
2 days ago
AI Distributed Training Infrastructure Engineer
$401 per month
Tesla is looking for a Software Engineer for the Optimus team in Palo Alto, CA. You'll build tools for neural network training, visualize data, and deploy networks to robots. Your contributions will directly impact thousands of Humanoid Robots in real-world applications...
Tesla
Palo Alto, CA
2 days ago
AI Infrastructure Engineer, Model Optimization & Deployment, Optimus
$176k - $420k
What to Expect Tesla AI solving robust, real‑world AI through humanoid robots. As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture, visualize data, assist with exporting...
Hourly pay
Full time
Temporary work
Flexible hours
Tesla Motors, Inc.
Palo Alto, CA
2 days ago
AI Infrastructure Engineer, Network Deployment & Inference, Optimus
$176k - $420k
Tesla AI is solving robust, real-world AI through humanoid robots. As a Software Engineer within our robotics teams, you will contribute to one of the most advanced Robotics/AI Platforms in the world. You’ll be responsible for integrating machine learning models with real...
Hourly pay
Temporary work
Flexible hours
Tesla
Palo Alto, CA
2 days ago
Senior AI Infrastructure Engineer - Model Training
$190k - $260k
...has developed an artificial intelligence (AI) powered technology stack purpose-built... ...large‑scale world models - depends on infrastructure that turns thousands of hours of multimodal... ...training throughput. We are looking for engineers who make model training fast: streaming...
Work at office
Flexible hours
Kodiak
Mountain View, CA
2 days ago
Staff AI-Driven Red Team Engineer
$110k - $260k
GEICO is seeking a Staff Security Engineer for their Red Team, focused on enhancing detection and response through AI-driven adversary operations. The ideal candidate will have extensive experience in Offensive Security operations and Red Team environments. This role includes...
GEICO
Palo Alto, CA
2 days ago
Staff On-Device AI Frameworks Engineer (iOS/macOS)
Argmax, Inc. in Palo Alto, CA is looking for a Staff Engineer for its On-device AI Frameworks team. This role focuses on designing and optimizing software frameworks for native inference workloads on both Apple and Android devices. You will work closely with research teams...
Work at office
Argmax, Inc.
Palo Alto, CA
2 days ago
Staff On-Device AI Frameworks Engineer
Argmax seeks a Staff Engineer for its On-device AI Frameworks team in Palo Alto, CA. In this role, you will design, implement, and optimize software frameworks for native AI workloads on Apple and Android devices. Collaborate closely with research teams, turning prototypes...
Flexible hours
Argmax
Palo Alto, CA
2 days ago
Principal AI Platform Engineer
$201k - $261k
...Coupa makes margins multiply through its community-generated AI and industry-leading total spend management platform for businesses... ...clients, the business, and each other. The Impact of a Principal Engineer at Coupa At Coupa, we’re building a future where Agentic AI and...
Work at office
Coupa
Foster, CA
20 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff AI Infrastructure Engineer. Be the first to apply!