ML Infra Engineer (Supercomputing)

Physical Intelligence

Physical Intelligence builds general-purpose AI for the physical world. Training our models requires orchestrating thousands of accelerators across a heterogeneous fleet of GPU and TPU clusters — spanning different hardware generations, cloud providers, and cluster topologies. Today, researchers often need to know which cluster to target, what resources are available, and how to configure their jobs accordingly. That doesn't scale. We need a scheduling and compute layer that makes the right placement decision automatically — routing jobs to the best cluster based on availability, hardware fit, cost, and priority — so researchers can focus entirely on the science. This role owns that problem end-to-end: the scheduling systems, the placement logic, the cluster management layer, and the operational tooling that keeps it all running. This is not cloud DevOps. It's not about standing up clusters and walking away. It's a systems role for people who care about intelligent resource allocation, utilization, fault tolerance, and making large-scale distributed training seamless. The Team The ML Infrastructure team supports and accelerates PI’s core modeling efforts by building the systems that make large-scale training reliable, reproducible, and fast. You will work closely with ML Infra (training systems), data platform, and research teams to ensure compute scheduling is never the bottleneck. In This Role You Will Own Intelligent Job Scheduling and Placement: Design and build multi-tenant scheduling systems that automatically place training jobs on the best available cluster based on hardware requirements, topology, availability, cost, and priority. Support fair resource sharing across teams and projects with quota management, priority tiers, and preemption policies. Abstract away cluster differences so researchers submit jobs without needing to know where they will land. Scale Multi-cluster Orchestration: Build the control plane that manages the job lifecycle across diverse clusters (mixed GPU/TPU, multi-generation hardware, on-prem/cloud) and enables seamless job migration, failover, and re-scheduling. Optimize Accelerator Utilization and Efficiency: Monitor and optimize GPU/TPU utilization across the entire fleet. Implement priority, preemption, queueing, and fairness policies that balance research velocity with cost efficiency. Ensure Scaling and Stability: Implement fault detection, automatic recovery, and resilience for long-running multi-node training jobs. Manage health checking, node management, and scaling to thousands of accelerators. Support Inference and Robot Deployment: Extend scheduling and orchestration to inference workloads, including deploying models to edge devices on physical robots. Enhance Observability and Developer Experience: Build the dashboards, alerting, SLOs, and debugging tools necessary for researchers to understand job status and for the team to ensure high scheduling quality and cluster reliability. What We Hope You’ll Bring Strong software engineering fundamentals Experience building or operating job scheduling / resource management systems at scale Experience with large-scale compute clusters (GPU and/or TPU) Familiarity with schedulers and orchestration systems (SLURM, Kubernetes, GKE, K3S, or internal equivalents) Comfort reasoning about resource allocation, bin-packing, priority scheduling, and multi-tenancy Understanding of how ML training workloads behave — long-running, multi-node, sensitive to stragglers, topology-dependent A bias toward owning systems end-to-end, from design to operation Enjoy working closely with researchers and unblocking fast-moving projects Bonus Points If You Have Experience building multi-cluster or federated scheduling systems Experience with TPU infrastructure (GCP TPU slices, Multislice, GKE) Background in cluster resource managers (Borg, YARN, Mesos, or custom schedulers) Linux systems engineering, networking, and infrastructure-as-code NCCL/collective communication and topology-aware placement Experience with capacity planning and cloud cost optimization at scale Familiarity with JAX, PyTorch, or similar ML frameworks at the runtime/systems level Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records. #J-18808-Ljbffr Physical Intelligence

Apply

Vacancy posted 18 hours ago

Similar jobs that could be interesting for youBased on the ML Infra Engineer (Supercomputing) in San Francisco, CA vacancy

ML Infra Engineer (TPU/Jax/Optimization)
...work closely with researchers and model engineers to translate ideas into experiments—and those... ...‑leverage role at the intersection of ML, software engineering, and scalable infrastructure... ...: Translate research needs into infra capabilities and guide best practices for...
Suggested
Physical Intelligence
San Francisco, CA
4 days ago
ML Infra Engineer: Scale GPU Training & Inference
Reducto, a fast-growing AI company in San Francisco, is hiring a Machine Learning Infra Engineer. This role involves building and maintaining the training and inference frameworks necessary for optimal performance. Ideal candidates should possess strong Python skills,...
Suggested
Reducto
San Francisco, CA
1 day ago
Machine Learning Engineer - Infra San Francisco, CA
$147.6k - $274k
...Job Description: Machine Learning Engineer - Infra San Francisco, CA The Opportunity We are revolutionizing drug discovery... ...learning techniques. We are seeking a highly motivated and skilled ML Engineer to join our growing team within Genentech Research...
Suggested
Relocation package
ESR Healthcare
San Francisco, CA
2 days ago
Founding ML Infra Engineer: Scale Real-Time Inference
URun in San Francisco is searching for an ML Infrastructure and Platform Engineer. In this role, you will lead the architecture and scaling of our GPU compute platform from the ground up, ensuring high availability and low-latency inference. This is a founding technical...
Suggested
URun
San Francisco, CA
2 days ago
Edge ML Infra Engineer for Real-Time Perception
A cutting-edge technology company in San Francisco is seeking an ML Infrastructure Engineer to build and scale machine learning systems for real-time perception and inference. This role involves designing scalable training pipelines for computer vision models, optimizing...
Suggested
Specter
San Francisco, CA
4 days ago
ML Infra Engineer: Scale GPU Compute & Models
$100k - $200k
Voiceflow is seeking a skilled ML-Infrastructure Engineer in San Francisco to architect and operate auto-scaling systems for our voice AI simulation platform. The role includes optimizing GPU and compute infrastructure, ensuring high performance and reliability. Ideal...
Work at office
Voiceflow
San Francisco, CA
2 days ago
ML Infra Engineer
...veterans and innovative thinkers. We don't believe culture can be engineered - but when it falls into place, it's a once-in-a-lifetime... ...never felt so present. Position Overview We're looking for an ML infrastructure engineer to help design, build, and scale the foundational...
Local area
Humble Robotics
San Francisco, CA
2 days ago
Machine Learning Infra Engineer
...Benchmark, and First Round Capital, and are hiring a Machine Learning Engineer to help us train and deploy the models critical to the performance of our core product. The Opportunity As an ML Infra Engineer , you’ll play a key role in building the inference and training...
Work at office
Local area
Reducto
San Francisco, CA
1 day ago
Senior ML Infra Engineer - Large-Scale Training & Pipelines
Responsibilities Design, deploy, and maintain large distributed ML training and inference clusters Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire ML lifecycle Research and test various training...
Kindredventures
San Francisco, CA
4 days ago
ML Infra Engineer: Smart Scheduling for Scaled Training
A leading AI technology company in San Francisco is seeking an engineering professional to develop and manage intelligent job scheduling systems for large-scale AI applications. This role focuses on ensuring efficient resource allocation across GPU and TPU clusters while...
Physical Intelligence
San Francisco, CA
2 days ago
ML Training Infra Engineer — JAX/TPU & Scale
A leading AI company in San Francisco is seeking a skilled ML Infrastructure Engineer to manage and optimize large-scale training systems. In this role, you will design and maintain infrastructure for model training, ensuring efficient GPU/TPU utilization while working...
Physical Intelligence
San Francisco, CA
4 days ago
Senior ML Engineer - Self-Healing AI for Global Infra
...company based in Seattle is looking for a Senior Machine Learning Engineer who will design and implement AI-driven solutions for optimizing... ...-edge research. Ideal candidates will bring over 8 years of ML experience, proficiency in PyTorch or TensorFlow, and a background...
DocuSign, Inc.
San Francisco, CA
3 days ago
ML Infra Engineer: Scale Training & Inference (Hybrid)
A leading technology company is looking for an ML Infrastructure Engineer in San Francisco. The successful candidate will build and maintain ML training pipelines and ensure low-latency model serving. Candidates should have over 4 years of experience in ML engineering,...
Work at office
Lattice, Inc.
San Francisco, CA
1 day ago
Senior GPU ML Infra Engineer — Mid-Training & Inference
A cutting-edge AI technology company based in San Francisco is seeking a specialist to design and operate large-scale GPU infrastructure. This role requires expertise in deploying GPU systems for high-throughput inference and model performance optimization. The ideal candidate...
Reflection AI
San Francisco, CA
2 days ago
ML Infra Engineer (Distributed Training)
ML Systems Engineer - Robotics & AI We are building the full-stack foundation for the next generation of humanoid robots, from high-performance, software-defined hardware to the foundational models and video world models that control them. Our robots are designed to be...
Maxwell Bond
San Francisco, CA
4 days ago
ML Infra Engineer — Scalable Training Systems
A leading tech company in San Francisco seeks a Machine Learning Engineer to build and maintain infrastructure for large-scale model training. In this hands-on role, you will design systems, work closely with researchers, and optimize training processes. Candidates should...
Monograph
San Francisco, CA
2 days ago
Senior ML Training Systems Engineer - Distributed GPU Infra
...technology company in San Francisco is looking for a Senior Software Engineer to build scalable infrastructure for large‑scale training and... .... Ideal candidates have over 5 years of experience in ML infrastructure and a strong background in distributed training frameworks...
Baseten
San Francisco, CA
2 days ago
Senior ML Systems Engineer - LLM Infra & Governance
A tech-driven company focused on blockchain solutions is seeking a Senior ML Systems Engineer. In this role, you will build reusable workflows, automate model versioning, and deploy scalable AI systems. Candidates should have strong programming skills, experience with scalable...
TRM Labs
San Francisco, CA
2 days ago
Production ML Engineer for Robotics Data Infra
A fast-growing robotics company in San Francisco seeks an applied ML engineer to design and scale the ML systems for their data platform. This hands-on role focuses on deploying production infrastructure, optimizing inference pipelines, and working on retrieval applications...
Remote work
Foxglove
San Francisco, CA
1 day ago
Senior ML Infra Engineer - Real-Time Data Systems
Arena Intelligence, Inc. in San Francisco, CA, is seeking a Senior Software Engineer (Infrastructure) to lead the design of scalable data and API systems. The role involves architecting real-time data pipelines, ensuring performance and reliability, and mentoring engineers...
Arena Intelligence, Inc.
San Francisco, CA
1 day ago
Edge ML Engineer for Farm Robotics & Data Infra
A technology startup in California is seeking a Machine Learning Engineer to develop robust solutions for ML/CV software relating to farm image data. The role involves building scalable ETL pipelines and collaborating with a dedicated team. An ideal candidate has 2+ years...
Full time
Orchard Robotics
San Francisco, CA
4 days ago
ML Infra Engineer for Multimodal Data Systems
A pioneering AI firm based in San Francisco is seeking a Research Engineer, Distributed Data Systems. In this role, you will design and maintain infrastructure for large-scale multimodal training, ensuring scalability and reliability of data systems. Candidates should...
Work at office
Relocation package
OpenAI
San Francisco, CA
3 days ago
ML Ops Engineer — Equity & AI Infra Architect
A pioneering AI company in the San Francisco Bay Area is seeking an ML Ops Engineer to automate model training, deployment, and governance processes. The ideal candidate will have extensive MLOps experience and be proficient in tools like Kubernetes and Terraform. This...
Fabrion
San Francisco, CA
4 days ago
Staff ML Inference Systems Engineer - Scalable GPU Infra (SF)
...a Member of Technical Staff focused on building and optimizing ML inference systems in San Francisco. The role involves designing... ...under real-world workloads. Candidates should have strong software engineering skills, experience with ML inference systems, and proficiency...
Acceler8 Talent
San Francisco, CA
1 day ago
Staff ML Infra Engineer - Low-Latency Distributed Systems
A leading streaming service is seeking a Staff Software Engineer to enhance ML infrastructure. The role involves designing scalable systems, mentoring engineers, and collaborating with cross-functional teams. Candidates should have over 8 years of experience in building...
Tubi Tv
San Francisco, CA
18 hours ago
Staff ML Systems Engineer — Frontier AI Infra
...on cutting-edge AI research and development. The role involves building and scaling training and inference infrastructure, designing ML kernels, and optimizing performance. Ideal candidates should have a passion for addressing ambitious challenges at the intersection of...
Mirendil
San Francisco, CA
2 days ago
ML Engineer
$250k - $400k
...research in isolation. It's building the engine that research runs on. You'll work closely... ...Experience building and scaling ML systems in production Strong background... ...Principal Roles available: ML Engineer, ML Infra, Research Engineers & Research Scientists...
Remote work
techire ai
San Francisco, CA
2 days ago
ML Engineer
...Ship models, not slide decks — partner with research and infra to prototype, train, and deploy state-of-the-art voice models... ...Qualifications: Expert-level PyTorch. Proven software engineer who loves ML; comfortable writing production code across the stack....
Full time
Contract work
Flexible hours
Shift work
SESAME
San Francisco, CA
3 days ago
Senior ML Infra Engineer — Scale ML Platforms & Data
A pioneering tech startup in neurotechnology is seeking a Senior Machine Learning Infrastructure Engineer to design and scale critical infrastructure powering ML applications. This role involves creating robust data pipelines and optimizing modeling processes, essential...
Echo Neurotechnologies
San Francisco, CA
2 days ago
ML Ops Engineer — Agentic AI Lab (Founding Team)
About the Role ML Ops Engineer — Agentic AI Lab (Founding Team) — Location: San Francisco Bay Area — Type: Full-Time — Compensation: Competitive... ...: 4+ years in MLOps, ML platform engineering, or infra-focused ML roles Deep familiarity with model lifecycle management...
Full time
Fabrion
San Francisco, CA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to ML Infra Engineer (Supercomputing). Be the first to apply!