Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

ML Infra Engineer

Physical Intelligence

ML Infrastructure Engineer

In this role you will help scale and optimize our training systems and core model code. You'll own critical infrastructure for large-scale training, from managing GPU/TPU compute and job orchestration to building reusable and efficient JAX training pipelines. You'll work closely with researchers and model engineers to translate ideas into experiments—and those experiments into production training runs.

This is a hands-on, high-leverage role at the intersection of ML, software engineering, and scalable infrastructure.

The Team

The ML Infrastructure team supports and accelerates PI's core modeling efforts by building the systems that make large-scale training reliable, reproducible, and fast. The team works closely with research, data, and platform engineers to ensure models can scale from prototype to production-grade training runs.

In This Role You Will
  • Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging.
  • Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction.
  • Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization.
  • Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments.
  • Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost.
  • Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale.
  • Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics.
What We Hope You'll Bring

- Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms.

- Hands-on large-scale training experience in JAX (preferred), PyTorch.

- Familiarity with distributed training, multi-host setups, data loaders, and evaluation pipelines.

- Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS).

- Ability to debug and optimize performance bottlenecks across the training stack.

- Strong cross-functional communication and ownership mindset.

Bonus Points If You Have

- Deep ML systems background (e.g., training compilers, runtime optimization, custom kernels).

- Experience operating close to hardware (GPU/TPU performance tuning).

- Background in robotics, multimodal models, or large-scale foundation models.

- Experience designing abstractions that balance researcher flexibility with system reliability.

Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the ML Infra Engineer in San Francisco, CA vacancy
  •  ...ML Infrastructure Engineer In this role you will help scale and optimize our training systems and core model code. You'll own critical infrastructure...  ...- Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale... 
    Suggested

    Physical Intelligence

    San Francisco, CA
    3 days ago
  •  ...training seamless. The Team The ML Infrastructure team supports and accelerates...  ...and fast. You will work closely with ML Infra (training systems), data platform, and...  ...candidates usually have: - Strong software engineering fundamentals - Experience building or... 
    Suggested
    Flexible hours

    Physical Intelligence

    San Francisco, CA
    3 days ago
  • Reducto, a fast-growing AI company in San Francisco, is hiring a Machine Learning Infra Engineer. This role involves building and maintaining the training and inference frameworks necessary for optimal performance. Ideal candidates should possess strong Python skills,... 
    Suggested

    Reducto

    San Francisco, CA
    3 days ago
  • Reducto, Inc. is hiring a Machine Learning Infra Engineer in San Francisco to build and maintain ML training and inference frameworks. The role focuses on high performance and scaling across multiple nodes and GPUs. The ideal candidate will have strong Python skills and... 
    Suggested

    Reducto, Inc.

    San Francisco, CA
    3 days ago
  •  ...Benchmark, and First Round Capital, and are hiring a Machine Learning Engineer to help us train and deploy the models critical to the performance of our core product. The Opportunity As an ML Infra Engineer , you'll play a key role in building the inference... 
    Suggested
    Work at office
    Local area

    Reducto

    San Francisco, CA
    17 hours ago
  • $147.6k - $274k

     ...Job Description: Machine Learning Engineer - Infra San Francisco, CA The Opportunity We are revolutionizing drug discovery...  ...learning techniques. We are seeking a highly motivated and skilled ML Engineer to join our growing team within Genentech Research... 
    Relocation package

    ESR Healthcare

    San Francisco, CA
    4 days ago
  • A cutting-edge technology company in San Francisco is seeking an ML Infrastructure Engineer to build and scale machine learning systems for real-time perception and inference. This role involves designing scalable training pipelines for computer vision models, optimizing... 

    Specter

    San Francisco, CA
    1 day ago
  • URun in San Francisco is searching for an ML Infrastructure and Platform Engineer. In this role, you will lead the architecture and scaling of our GPU compute platform from the ground up, ensuring high availability and low-latency inference. This is a founding technical... 

    URun

    San Francisco, CA
    4 days ago
  • $100k - $200k

    Voiceflow is seeking a skilled ML-Infrastructure Engineer in San Francisco to architect and operate auto-scaling systems for our voice AI simulation platform. The role includes optimizing GPU and compute infrastructure, ensuring high performance and reliability. Ideal... 
    Work at office

    Voiceflow

    San Francisco, CA
    4 days ago
  •  ...veterans and innovative thinkers. We don't believe culture can be engineered - but when it falls into place, it's a once-in-a-lifetime...  ...never felt so present. Position Overview We're looking for an ML infrastructure engineer to help design, build, and scale the foundational... 
    Local area

    Humble Robotics

    San Francisco, CA
    4 days ago
  • ML Systems Engineer - Robotics & AI We are building the full-stack foundation for the next generation of humanoid robots, from high-performance, software-defined hardware to the foundational models and video world models that control them. Our robots are designed to be... 

    Maxwell Bond

    San Francisco, CA
    1 day ago
  • DocuSign, Inc. in San Francisco, California is seeking a Senior Machine Learning Engineer to redefine global services operations. You will design autonomous multi-agent systems using Reinforcement Learning and develop deep learning models for high-volume time series data... 
    Work at office
    2 days per week

    DocuSign, Inc.

    San Francisco, CA
    3 days ago
  •  ...company based in Seattle is looking for a Senior Machine Learning Engineer who will design and implement AI-driven solutions for optimizing...  ...-edge research. Ideal candidates will bring over 8 years of ML experience, proficiency in PyTorch or TensorFlow, and a background... 

    DocuSign, Inc.

    San Francisco, CA
    17 hours ago
  • A leading technology company is looking for an ML Infrastructure Engineer in San Francisco. The successful candidate will build and maintain ML training pipelines and ensure low-latency model serving. Candidates should have over 4 years of experience in ML engineering,... 
    Work at office

    Lattice, Inc.

    San Francisco, CA
    3 days ago
  • A leading AI company in San Francisco is seeking a skilled ML Infrastructure Engineer to manage and optimize large-scale training systems. In this role, you will design and maintain infrastructure for model training, ensuring efficient GPU/TPU utilization while working... 

    Physical Intelligence

    San Francisco, CA
    1 day ago
  • $131.4k - $235.95k

    Autodesk, Inc. is seeking a Senior Machine Learning Engineer for MLOps in San Francisco. You will ensure AI-powered experiences meet high standards for reliability and scalability. Key responsibilities include automating model testing, managing inference services, and... 

    Autodesk, Inc.

    San Francisco, CA
    17 hours ago
  • A cutting-edge AI technology company based in San Francisco is seeking a specialist to design and operate large-scale GPU infrastructure. This role requires expertise in deploying GPU systems for high-throughput inference and model performance optimization. The ideal candidate...

    Reflection AI

    San Francisco, CA
    4 days ago
  •  ...video on the internet. We are looking for exceptional research engineers and applied researchers to help push the frontier of interactive...  ...We're looking for a Member of Technical Staff - Data & ML Infrastructure Engineer to help build and optimize the systems... 

    Moonlake AI

    San Francisco, CA
    4 days ago
  • A leading tech company in San Francisco seeks a Machine Learning Engineer to build and maintain infrastructure for large-scale model training. In this hands-on role, you will design systems, work closely with researchers, and optimize training processes. Candidates should... 

    Monograph

    San Francisco, CA
    4 days ago
  •  ...technology company in San Francisco is looking for a Senior Software Engineer to build scalable infrastructure for large‑scale training and...  .... Ideal candidates have over 5 years of experience in ML infrastructure and a strong background in distributed training frameworks... 

    Baseten

    San Francisco, CA
    4 days ago
  • A tech-driven company focused on blockchain solutions is seeking a Senior ML Systems Engineer. In this role, you will build reusable workflows, automate model versioning, and deploy scalable AI systems. Candidates should have strong programming skills, experience with scalable... 

    TRM Labs

    San Francisco, CA
    4 days ago
  • A fast-growing robotics company in San Francisco seeks an applied ML engineer to design and scale the ML systems for their data platform. This hands-on role focuses on deploying production infrastructure, optimizing inference pipelines, and working on retrieval applications... 
    Remote work

    Foxglove

    San Francisco, CA
    3 days ago
  • A pioneering AI firm based in San Francisco is seeking a Research Engineer, Distributed Data Systems. In this role, you will design and maintain infrastructure for large-scale multimodal training, ensuring scalability and reliability of data systems. Candidates should... 
    Work at office
    Relocation package

    OpenAI

    San Francisco, CA
    17 hours ago
  • A technology startup in California is seeking a Machine Learning Engineer to develop robust solutions for ML/CV software relating to farm image data. The role involves building scalable ETL pipelines and collaborating with a dedicated team. An ideal candidate has 2+ years... 
    Full time

    Orchard Robotics

    San Francisco, CA
    1 day ago
  • Arena Intelligence, Inc. in San Francisco, CA, is seeking a Senior Software Engineer (Infrastructure) to lead the design of scalable data and API systems. The role involves architecting real-time data pipelines, ensuring performance and reliability, and mentoring engineers... 

    Arena Intelligence, Inc.

    San Francisco, CA
    3 days ago
  • $181.1k - $318.4k

    Apple Inc. is looking for a Staff ML Infrastructure Engineer in San Francisco to lead pre-training initiatives for cutting-edge foundation models in machine learning. The successful candidate will have over 6 years of experience in building scalable backend systems, be... 

    Apple Inc.

    San Francisco, CA
    1 day ago
  • A pioneering AI company in the San Francisco Bay Area is seeking an ML Ops Engineer to automate model training, deployment, and governance processes. The ideal candidate will have extensive MLOps experience and be proficient in tools like Kubernetes and Terraform. This... 

    Fabrion

    San Francisco, CA
    1 day ago
  • Ensure that ML models can be effectively developed, deployed, managed, and monitored in Production environments. Productionize ML models - integrate trained ML models with Production systems Build and manage ML pipelines - design, build, and maintain automated pipelines... 
    Permanent employment
    Contract work
    Local area

    Cloud Hybrid Technologies, LLC

    San Francisco, CA
    2 days ago
  •  ...a Member of Technical Staff focused on building and optimizing ML inference systems in San Francisco. The role involves designing...  ...under real-world workloads. Candidates should have strong software engineering skills, experience with ML inference systems, and proficiency... 

    Acceler8 Talent

    San Francisco, CA
    3 days ago
  •  ...on cutting-edge AI research and development. The role involves building and scaling training and inference infrastructure, designing ML kernels, and optimizing performance. Ideal candidates should have a passion for addressing ambitious challenges at the intersection of... 

    Mirendil

    San Francisco, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to ML Infra Engineer. Be the first to apply!