ML Infra Engineer (TPU/Jax/Optimization)

Physical Intelligence

In this role you will help scale and optimize our training systems and core model code. You’ll own critical infrastructure for large-scale training, from managing GPU/TPU compute and job orchestration to building reusable and efficient JAX training pipelines. You’ll work closely with researchers and model engineers to translate ideas into experiments—and those experiments into production training runs. This is a hands‑on, high‑leverage role at the intersection of ML, software engineering, and scalable infrastructure. The Team The ML Infrastructure team supports and accelerates PI’s core modeling efforts by building the systems that make large‑scale training reliable, reproducible, and fast. The team works closely with research, data, and platform engineers to ensure models can scale from prototype to production‑grade training runs. In This Role You Will Own training/inference infrastructure: Design, implement, and maintain systems for large‑scale model training, including scheduling, job management, checkpointing, and metrics/logging. Scale distributed training: Work with researchers to scale JAX‑based training across TPU and GPU clusters with minimal friction. Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization. Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments. Manage compute resources: Ensure efficient allocation and utilization of cloud‑based GPU/TPU compute while controlling cost. Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale. Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics. What We Hope You’ll Bring Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms. Hands‑on large‑scale training experience in JAX (preferred), PyTorch. Familiarity with distributed training, multi‑host setups, data loaders, and evaluation pipelines. Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS). Ability to debug and optimize performance bottlenecks across the training stack. Strong cross‑functional communication and ownership mindset. Bonus Points If You Have Deep ML systems background (e.g., training compilers, runtime optimization, custom kernels). Experience operating close to hardware (GPU/TPU performance tuning). Background in robotics, multimodal models, or large‑scale foundation models. Experience designing abstractions that balance researcher flexibility with system reliability. #J-18808-Ljbffr

Apply

Vacancy posted 10 hours ago

Similar jobs that could be interesting for youBased on the ML Infra Engineer (TPU/Jax/Optimization) in San Francisco, CA vacancy

ML Training Infra Engineer — JAX/TPU & Scale
...San Francisco is seeking a skilled ML Infrastructure Engineer to manage and optimize large-scale training systems. In this... ...training, ensuring efficient GPU/TPU utilization while working closely with... ...skills and experience with JAX, distributed training, and cloud platforms...
Suggested
Physical Intelligence
San Francisco, CA
3 days ago
ML Infra Engineer - Supercomputing
...heterogeneous fleet of GPU and TPU clusters — spanning... ...The Team The ML Infrastructure team... ...closely with ML Infra (training systems),... ...re-scheduling. - Optimize Accelerator... ...- Strong software engineering fundamentals - Experience... ...- Familiarity with JAX, PyTorch, or...
Suggested
Flexible hours
Physical Intelligence
San Francisco, CA
9 hours ago
ML Infra Engineer: Scale GPU Training & Inference
..., a fast-growing AI company in San Francisco, is hiring a Machine Learning Infra Engineer. This role involves building and maintaining the training and inference frameworks necessary for optimal performance. Ideal candidates should possess strong Python skills, have a background...
Suggested
Reducto
San Francisco, CA
9 hours ago
ML Infra Engineer: Smart Scheduling for Scaled Training
...company in San Francisco is seeking an engineering professional to develop and manage intelligent... ...resource allocation across GPU and TPU clusters while enhancing overall system... ...collaborate closely with researchers to optimize the performance of AI workloads. Competitive...
Suggested
Physical Intelligence
San Francisco, CA
10 hours ago
Senior / Staff ML Training Optimization Engineer
$141k - $249k
.../deployment techniques. - Work with researchers and ML engineers on best-practices for optimal resource usage. - Create and improve tooling and dashboards... ...in deep learning frameworks such as PyTorch or Jax. - Skilled in profiling CPU and GPU code using tools...
Suggested
Work at office
Work from home
Flexible hours
Waabi
San Francisco, CA
2 days ago
ML Infra Engineer — Scalable Training Systems
...in San Francisco seeks a Machine Learning Engineer to build and maintain infrastructure for... ...systems, work closely with researchers, and optimize training processes. Candidates should... ...software engineering skills and experience with JAX or PyTorch. Join a dynamic team at the...
Monograph
San Francisco, CA
1 day ago
Edge ML Infra Engineer for Real-Time Perception
...cutting-edge technology company in San Francisco is seeking an ML Infrastructure Engineer to build and scale machine learning systems for real-time... ...scalable training pipelines for computer vision models, optimizing them for edge devices, and collaborating closely with...
Specter Services LLC
San Francisco, CA
10 hours ago
ML Infra Engineer: Scale GPU Compute & Models
$100k - $200k
...Voiceflow is seeking a skilled ML-Infrastructure Engineer in San Francisco to architect and operate auto-scaling systems for our voice AI simulation platform. The role includes optimizing GPU and compute infrastructure, ensuring high performance and reliability. Ideal...
Work at office
Voiceflow
San Francisco, CA
10 hours ago
Senior ML Engineer - Self-Healing AI for Global Infra
...Seattle is looking for a Senior Machine Learning Engineer who will design and implement AI-driven solutions for optimizing their infrastructure. This role requires strong... ...research. Ideal candidates will bring over 8 years of ML experience, proficiency in PyTorch or TensorFlow...
DocuSign
San Francisco, CA
10 hours ago
ML Engineer, Credit & Refund Optimization
...Fairygodboss is looking for a Machine Learning Engineer in San Francisco, California, to lead the development of cutting-edge ML systems that enhance customer experience... ...systems, particularly in personalization, optimization, or causal inference. A comprehensive benefits...
Fairygodboss
San Francisco, CA
10 hours ago
Sr. ML Optimization Engineer, iCloud
$181.1k - $318.4k
...commitment to environmental sustainability and optimal resource utilization. This team plays a... ...at scale. This team also focuses on ML-driven forecasting, capacity planning, resource... ...scale services. As a Sr. ML Optimization Engineer, you will work at the intersection of...
Relocation
Apple
San Francisco, CA
1 day ago
ML Infra Engineer (Distributed Training)
...ML Systems Engineer – Robotics & AI We are building the full-stack foundation for the next generation of humanoid robots, from high-performance... ...modern training stacks such as PyTorch, with familiarity in JAX a plus. Deep understanding of distributed training concepts and...
Maxwell Bond
San Francisco, CA
1 day ago
ML Inference & System Optimization Engineer
...Zensors is seeking a Machine Learning Engineer focused on ML Runtime & Optimization to enhance our visual sensing platform. The role involves optimizing machine learning pipelines and collaborating with AI research teams to implement high-performance algorithms. Ideal...
Zensors
San Francisco, CA
9 hours ago
Senior ML Training Systems Engineer - Distributed GPU Infra
...Francisco is looking for a Senior Software Engineer to build scalable infrastructure for... ...design distributed training systems and optimize GPU utilization while collaborating with... ...candidates have over 5 years of experience in ML infrastructure and a strong background in...
BaseTen
San Francisco, CA
10 hours ago
Senior ML Engineer for Autonomous Ad Optimization
$160k - $240k
Tensec is looking for a Machine Learning Engineer in San Francisco to build algorithms and optimization systems that drive their autonomous decision engine. This role involves designing trading strategies, building execution layers, and deploying robust models that enhance...
Relocation package
Tensec
San Francisco, CA
4 days ago
Senior ML Infra Engineer - Large-Scale Training & Pipelines
Responsibilities Design, deploy, and maintain large distributed ML training and inference clusters Develop efficient, scalable end... ...scales Analyze, profile and debug low-level GPU operations to optimize performance Stay up-to-date on research to bring new ideas to work...
Kindredventures
San Francisco, CA
3 days ago
Senior GPU ML Infra Engineer — Mid-Training & Inference
...scale GPU infrastructure. This role requires expertise in deploying GPU systems for high-throughput inference and model performance optimization. The ideal candidate will have hands-on experience with modern inference frameworks and a solid understanding of reinforcement...
Reflection AI
San Francisco, CA
1 day ago
Senior/Staff ML Engineer, Performance Optimization
...Role We're looking for someone who loves optimizing model inference to join us in building the... ...complex and bleeding-edge part of our engine. You'll be working on making AI models run... ...just works You think the current state of ML deployment could be way better What you'll...
Comfy
San Francisco, CA
9 hours ago
Production ML Engineer for Robotics Data Infra
...fast-growing robotics company in San Francisco seeks an applied ML engineer to design and scale the ML systems for their data platform.... ...-on role focuses on deploying production infrastructure, optimizing inference pipelines, and working on retrieval applications over...
Remote work
Foxglove
San Francisco, CA
5 days ago
ML Infrastructure Engineer — Scale Training Pipelines
...hands-on role focused on scaling and optimizing ML training systems. Key... ...improving performance, and managing GPU/TPU compute resources. Ideal candidates will have strong software engineering foundations, hands-on experience in JAX and PyTorch, and familiarity with...
Physical Intelligence
San Francisco, CA
4 days ago
Production ML Engineer: Pipelines, Cloud Infra & Automation
Ensure that ML models can be effectively developed, deployed, managed, and monitored in Production environments. Productionize ML... ...workflow to improve efficiency and reproducibility Performance optimization - identify ways to optimize the performance, efficiency, and scalability...
Permanent employment
Contract work
Local area
Cloud Hybrid Technologies, LLC
San Francisco, CA
4 days ago
Staff ML Inference Systems Engineer - Scalable GPU Infra (SF)
...looking for a Member of Technical Staff focused on building and optimizing ML inference systems in San Francisco. The role involves... ...real-world workloads. Candidates should have strong software engineering skills, experience with ML inference systems, and proficiency...
Acceler8 Talent
San Francisco, CA
5 days ago
Staff ML Systems Engineer — Frontier AI Infra
...edge AI research and development. The role involves building and scaling training and inference infrastructure, designing ML kernels, and optimizing performance. Ideal candidates should have a passion for addressing ambitious challenges at the intersection of AI and...
Mirendil
San Francisco, CA
1 day ago
Machine Learning Engineer - Infra San Francisco, CA
$147.6k - $274k
...Job Description: Machine Learning Engineer - Infra San Francisco, CA The Opportunity We are revolutionizing drug discovery... ...learning techniques. We are seeking a highly motivated and skilled ML Engineer to join our growing team within Genentech Research...
Relocation package
ESR Healthcare
San Francisco, CA
1 day ago
Machine Learning Infra Engineer
...Benchmark, and First Round Capital, and are hiring a Machine Learning Engineer to help us train and deploy the models critical to the performance of our core product. The Opportunity As an ML Infra Engineer , you’ll play a key role in building the inference and training...
Work at office
Local area
Reducto
San Francisco, CA
1 day ago
ML Engineer
...Mach9 ML Engineer Role At Mach9, ML Engineers build the perception models at the core of... ...production-quality ML library like PyTorch, JAX, or TensorFlow. Bonus... ...delivering production-grade models with optimization techniques such as quantization, pruning...
Mach9
San Francisco, CA
2 days ago
ML Engineer
$250k - $400k
...in isolation. It's building the engine that research runs on. You'... ...Experience building and scaling ML systems in production Strong... ...in frameworks like PyTorch, JAX, or similar Strong engineering... ...Roles available: ML Engineer, ML Infra, Research Engineers & Research...
Remote work
techire ai
San Francisco, CA
1 day ago
Founding ML Infra Engineer: Scale Real-Time Inference
...URun in San Francisco is searching for an ML Infrastructure and Platform Engineer. In this role, you will lead the architecture and scaling of our GPU compute platform from the ground up, ensuring high availability and low-latency inference. This is a founding technical...
U-Run
San Francisco, CA
9 hours ago
ML Infra Engineer
...veterans and innovative thinkers. We don't believe culture can be engineered - but when it falls into place, it's a once-in-a-lifetime... ...never felt so present. Position Overview We're looking for an ML infrastructure engineer to help design, build, and scale the foundational...
Local area
Humble Robotics
San Francisco, CA
1 day ago
Senior ML Infra Engineer Scale ML Platforms & Data
...neurotechnology is seeking a Senior Machine Learning Infrastructure Engineer to design and scale critical infrastructure powering ML applications. This role involves creating robust data pipelines and optimizing modeling processes, essential for developing innovative...
Echo Neurotechnologies
San Francisco, CA
10 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to ML Infra Engineer (TPU/Jax/Optimization). Be the first to apply!