Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Staff AI Infrastructure Engineer

$241k - $331k

Biohub

Biohub is the first large-scale initiative bringing frontier AI models, massive compute, and frontier experimental capabilities under one roof. We're building a general-purpose system to accelerate scientific discovery, integrating frontier AI models, biological foundation models, and lab capabilities, with the ultimate goal of curing disease. Our technology powers scientists around the world, translating AI capabilities into tools that accelerate research everywhere. The Team The AI Cluster Production Engineering team is part of the AI Compute Platform organization at Biohub, a non-profit research lab committed to open science and open-source AI. We own the design, operation, and reliability of large-scale multi‑GPU AI clusters that power frontier AI biology research: protein language models, genomic foundation models, and scientific reasoning systems built to be shared, not monetized. Our clusters run Slurm on Kubernetes infrastructure and support everything from day‑to‑day AI researcher workflows to multi‑node hero training runs at thousands of GPUs. The team works at the intersection of AI tooling, distributed systems, HPC, and frontier AI, debugging deep AI infrastructure problems and building AI systems critical to the entire AI organization. The Opportunity CZ Biohub's mission is to cure or prevent all human disease. Achieving that requires training frontier-scale AI biology models, and that demands reliable, high-performance compute infrastructure. This is production engineering work at a frontier AI lab, with the twist that the mission is biology and the science is open. You'll keep GPU clusters running at high utilization, debug the toughest distributed systems failures, and build the operational foundations for scaling to multi‑thousand GPU hero runs. The technical problems are genuinely hard (e.g., multi‑node distributed training, InfiniBand fabrics, large-scale storage, Slurm at scale) inside an organization where the work is aimed at helping people, not optimizing ad revenue. What You'll Do Own reliability, observability, and incident response for multi‑site GPU clusters running Slurm on Kubernetes. Build the systems, automation, and processes that keep clusters healthy, and that enable fast, efficient recovery when things break. Debug and resolve deep infrastructure failures across storage, networking, scheduling, and GPU compute layers. Build the tooling and operational patterns that make these failures easier to detect, diagnose, and prevent. Design and execute GPU cluster scaling plans, systematically validating storage, networking, interconnect, and scheduler behavior as clusters grow to support larger training runs. Build automation and tooling to manage cluster operations at scale: capacity planning, GPU utilization monitoring workload manager policy management, and pod lifecycle automation. Drive configuration‑as‑code practices, ensuring cluster state is reproducible and auditable, and managed through version‑controlled pipelines. Collaborate directly with AI researchers and hero run leads to understand training workload patterns and design infrastructure that meets frontier‑scale requirements. Own the vendor relationship on technical issues — escalating SEV1s, coordinating across multiple partners and network backbone teams, holding them accountable to root/proximate cause analysis and SLAs. Contribute to capacity planning: projecting GPU demand, managing cluster expansion across GPU generations, and coordinating multi‑cluster strategy. Improve operational resilience, reducing mean time to detect and resolve incidents, reducing toil through automation, and developing runbooks that scale the team's operational knowledge beyond any individual. What You'll Bring 8+ years of AI/ML infrastructure engineering experience, with deep expertise in at least one of: HPC/Slurm cluster operations, Kubernetes at scale, distributed systems debugging, or GPU compute infrastructure. Strong Linux systems fundamentals — networking (TCP/IP, InfiniBand, RDMA, MTU/MSS/PMTUD), storage (NFS, VAST, WEKA, POSIX semantics), kernel internals (cgroups, namespaces, eBPF, sysctls). Hands‑on experience with Kubernetes and cloud‑native infrastructure — pod lifecycle, CNI plugins (Cilium preferred), StatefulSets, Helm, ArgoCD, or equivalent GitOps tooling. Experience with HPC workload managers — Slurm strongly preferred (QoS, partitions, preemption, accounting, Sunk/CoreWeave patterns a plus). Debugging instinct: ability to form hypotheses quickly, design controlled experiments, and root cause complex multi‑system failures under pressure. You enjoy finding the hard bugs. Proficiency in Python and Bash for automation and tooling. Go, Rust, or C/C++ a plus. Experience with observability stacks — Prometheus/VictoriaMetrics, Grafana, DCGM metrics, distributed tracing. You know how to instrument systems you don't control. Excellent communication — you can write a crisp incident summary for researchers, a technical escalation to a vendor CTO, and a system design doc for teammates, all in the same day. Bonus: experience with distributed AI training infrastructure (NCCL, PyTorch DDP, multi‑node job debugging, checkpoint/restart patterns, container environments for large‑scale training). Compensation The Redwood City, CA base pay range for a new hire in this role is $241,000 - $331,000 . New hires are typically hired into the lower portion of the range, enabling employee growth in the range over time. Actual placement in range is based on job‑related skills and experience, as evaluated throughout the interview process. Better Together As we grow, we’re excited to strengthen in‑person connections and cultivate a collaborative, team‑oriented environment. This role is a hybrid position requiring you to be onsite for at least 60% of the working month, approximately 3 days a week, with specific in‑office days determined by the team’s manager. The exact schedule will be at the hiring manager's discretion and communicated during the interview process. Benefits for the Whole You We’re thankful to have an incredible team behind our work. To honor their commitment, we offer a wide range of benefits to support the people who make all we do possible. Provides a generous employer match on employee 401(k) contributions to support planning for the future. Paid time off to volunteer at an organization of your choice. Funding for select family‑forming benefits. Relocation support for employees who need assistance moving #J-18808-Ljbffr

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Staff AI Infrastructure Engineer in Redwood City, CA vacancy
  •  ...Cacheflow is looking for a Staff Software Engineer for their AI-Native Web Platform in Mountain View, California. You will define the technical roadmap, drive architectural decisions, and mentor engineers in a cutting-edge environment. The ideal candidate has extensive... 
    Suggested

    Cacheflow

    Mountain View, CA
    8 hours ago
  •  ...About Obvio AI Each year, more than 40,000 people in the U.S. leave home and never...  ...serving and lifecycle layer. Stand up the infrastructure that loads versioned CV models and...  ...back without pipeline downtime. Set the engineering standard. This is an early hire. You’ll... 
    Suggested
    Local area

    Obvio Inc.

    San Carlos, CA
    3 days ago
  •  ...Biohub in Redwood City, CA is seeking an experienced AI/ML Infrastructure Engineer to manage and ensure reliability of GPU clusters running Slurm on Kubernetes. You will work on large-scale AI biology research, debugging infrastructure failures and designing systems for... 
    Suggested
    3 days per week

    Biohub

    Redwood City, CA
    1 day ago
  • $230k - $360k

     ...About Luma AI A new class of intelligence is emerging, systems that understand...  ...not just a modeling challenge. It is an infrastructure challenge at the edge of what hardware,...  ...still being written. A single exceptional engineer can reshape how the company operates.... 
    Suggested
    Immediate start

    Luma AI

    Redwood City, CA
    19 hours ago
  •  ...tackle complex technical challenges that push the boundaries of what AI can do: Build no-code agentic orchestration frameworks...  ...and reduce hallucinations for nuanced business scenarios. Engineer scalable backend services that power intuitive UIs for seamless... 
    Suggested
    Remote work

    Wisq

    Redwood City, CA
    3 days ago
  •  ...Obvio Inc. in San Carlos, California, is looking for an experienced backend systems engineer to design scalable workflows, manage ML infrastructures, and ensure production reliability. This role offers the chance to contribute to life-saving technology in a fast-paced... 

    Obvio Inc.

    San Carlos, CA
    4 days ago
  •  ...Obvio is seeking a backend systems engineer to build the orchestration layer for our AI-assisted traffic enforcement technology. Ideal candidates will have over 6 years of experience with production backend systems and strong familiarity with ML pipelines. This pivotal... 

    Obvio

    San Carlos, CA
    3 days ago
  • $200k - $240k

     ...Salt Digital Recruitment is seeking a Senior Full Stack Software Engineer to join their team in Redwood City, CA. This role involves...  ..., and Python. As a pivotal member of the team, you will work on AI-driven products, collaborating with various teams to deliver innovative... 

    Salt Digital Recruitment

    Redwood City, CA
    4 days ago
  • Snorkel is seeking a Senior Software Engineer for its AI Platform in Redwood City, CA, focusing on architecting solutions for synthetic data generation and large-scale AI systems. This hybrid role calls for extensive experience in cloud-native software systems and deep... 

    jobs.frontdoordefense.com - Jobboard

    Redwood City, CA
    2 days ago
  • Wisq, Inc. in Redwood City, CA is seeking experienced AI engineers to tackle complex challenges in AI technology. The role demands expertise in LLM fine-tuning, distributed systems, and ML solutions, while working in a collaborative and hybrid/remote environment. Candidates... 
    Remote job

    Wisq, Inc.

    Redwood City, CA
    4 days ago
  • $150k - $200k

    A pioneering technology company is looking for a software engineer to enhance their developer infrastructure at their Redwood City office. The ideal candidate will manage CI/CD pipelines, automate testing procedures, and design device simulators to improve the efficiency... 
    Full time
    Work at office

    Epoch Biodesign

    Redwood City, CA
    2 days ago
  • $70 - $79 per hour

     ...FocusKPI is seeking an AI Infrastructure & Experience Engineer to join one of our clients, a high-tech SaaS company. Work Location: Mountain View, CA (Onsite role, 5 days/week onsite) Duration: 4‑month contract Pay Range: $70 - 79/hr No C2C resumes are considered Position... 
    Contract work
    Local area
    Shift work

    FocusKPI Inc.

    Mountain View, CA
    19 hours ago
  • $180k - $240k

     ...facilitating effortless integration into customers' logistics operations. About the role We are seeking a Senior AI Infrastructure Engineer to design, build, and scale the high-performance AI platform powering our autonomous driving models. While researchers focus... 
    Odd job
    Work at office

    Gatik AI

    Mountain View, CA
    4 days ago
  •  ...Getaida, an innovative B2B enterprise AI company based in Palo Alto, is looking for a strong AI Engineer. The ideal candidate will have a solid background in building Large Language Models (LLMs) and an eye for new emerging models. This position involves collaborating... 

    Getaida

    Palo Alto, CA
    3 days ago
  • $191k - $315k

     ...Senior Staff AI Engineer, Network Growth AI LinkedIn is the world's largest professional network, built to create economic opportunity...  ...discipline. ~ Prior experience with large scale ML data infrastructure ~ Experience with developing and designing production scale... 
    For contractors
    Work at office
    Flexible hours

    LinkedIn

    Mountain View, CA
    4 days ago
  • $248.7k - $342k

     ...A leading AI-driven technology firm in Palo Alto is seeking a Senior Director of Engineering to lead architecture and product delivery. This role focuses on driving innovation using Large Language Models and enhancing conversational intelligence. The ideal candidate has... 

    Uniphore Technologies Inc.

    Palo Alto, CA
    4 days ago
  • $145.1k - $273.2k

     ...the underlying hardware logic of various AI accelerators ; evaluate the power-...  ...implementation of emerging technologies within cloud infrastructure. Who We Look For 1.Education: Master's or Ph.D. degree in Computer Engineering, Electronic Engineering, Microelectronics... 
    Relocation package

    Tencent

    Palo Alto, CA
    19 hours ago
  • A leading AI healthcare solutions company in Mountain View is seeking a Senior/Staff Software Engineer to innovate in building AI agent infrastructure for healthcare operations. The ideal candidate has over 7 years of experience in developing AI systems and a strong product... 
    Full time

    Joinhoneyhealth

    Mountain View, CA
    3 days ago
  • $220k - $350k

     ...actively developing the technologies to make this possible, with the ultimate goal of enabling human life on Mars. SR AI ENGINEER, PLATFORM INFRASTRUCTURE, SPECIAL PROGRAMS As an AI Engineer, Platform Infrastructure you will build the tooling, and work with our cleared... 
    Permanent employment
    Temporary work
    Immediate start
    Weekend work

    United States Digital Space LLC

    Palo Alto, CA
    1 day ago
  •  ...Guidewire Software is seeking an experienced engineer to join its AI Platform group in San Mateo, California. The ideal candidate will design and build core AI platform services and mentor associate engineers, contributing to innovative solutions that enhance employee... 
    Flexible hours

    Guidewire

    San Mateo, CA
    4 days ago
  •  ...Notable is looking for a Senior AI Platform Engineer in San Mateo to design and maintain LLM integrations that enhance AI capabilities across solutions. This role involves translating requirements into actionable technical plans and ensuring the robustness of features... 
    3 days per week

    Notable

    San Mateo, CA
    4 days ago
  • $190k - $250k

     ...Employment Type Full time Location Type Hybrid Department AI We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes,...  ...Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our... 
    Full time

    Pantera Capital

    Palo Alto, CA
    3 days ago
  •  ...A next-generation SaaS company in Mountain View is seeking a Software Engineering Manager to lead their Platform Engineering team. The role entails designing and delivering an AI-powered fraud detection platform, managing and mentoring a talented team while applying cutting... 

    DataVisor

    Mountain View, CA
    3 days ago
  • $171k - $260k

     ...notch technology products. As a Senior Lead Software Engineer at JPMorgan Chase within the Corporate Sector, Infrastructure Platforms team, you are an integral part of an...  ...cloud infrastructure platforms optimized for AI and machine learning workloads. Collaborate with... 
    For contractors

    Fairygodboss

    Palo Alto, CA
    4 days ago
  •  ...GoTo Meeting is looking for its first dedicated Agent Platform engineer to build the foundational systems for next-generation AI products. This role involves working collaboratively in a small team to ship innovative solutions each week, integrating and benchmarking new... 

    GoToMeeting

    Mountain View, CA
    3 days ago
  •  ...Labs in Palo Alto is looking for a seasoned Software Engineer to spearhead the development of an AI governance platform. Responsibilities include end-to-end software development, working with AWS cloud infrastructure, and leading technical execution on significant projects... 

    Centaur Labs

    Palo Alto, CA
    4 days ago
  • AI Quality Infrastructure Engineer Job in USA 2026 with Visa Sponsorship AI Quality Infrastructure Engineer Job in USA 2026 with Visa Sponsorship A global technology delivery partner is hiring an AI Quality Infrastructure Engineer for enterprise AI operations in Mountain... 
    Permanent employment
    Full time
    Contract work
    H1b
    Relocation
    Visa sponsorship
    Work visa

    NewsNowGh

    Mountain View, CA
    4 days ago
  •  ...Notable is the leading healthcare AI platform for transforming...  ...growth without hiring more staff. We are on a mission to improve...  .... As a Senior AI Platform Engineer, you will design, build, and...  ...Terraform and Helm Charts for infrastructure and deployment. Google... 
    Work at office
    Remote work
    3 days per week

    Notable

    San Mateo, CA
    19 hours ago
  •  ...Description The AV Pipelines & Lineage team at GM builds core infrastructure that supports the end-to-end AI lifecycle of ML pipelines—from local experimentation...  ...systems and user-facing interfaces, enabling ML engineers and researchers to develop, understand, and evolve... 
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    Israelvcforum

    Mountain View, CA
    4 days ago
  • $201k - $281k

     ...Principal Engineer At Coupa Coupa makes margins multiply through its community-generated AI and industry-leading total spend management platform for businesses large and small. Coupa AI is informed by trillions of dollars of direct and indirect spend data across a... 
    Work at office

    Coupa

    San Mateo, CA
    11 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff AI Infrastructure Engineer. Be the first to apply!