Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

ML Infra Engineer (TPU/Jax/Optimization)

Physical Intelligence

In this role you will help scale and optimize our training systems and core model code. You’ll own critical infrastructure for large-scale training, from managing GPU/TPU compute and job orchestration to building reusable and efficient JAX training pipelines. You’ll work closely with researchers and model engineers to translate ideas into experiments—and those experiments into production training runs. This is a hands‑on, high‑leverage role at the intersection of ML, software engineering, and scalable infrastructure. The Team The ML Infrastructure team supports and accelerates PI’s core modeling efforts by building the systems that make large‑scale training reliable, reproducible, and fast. The team works closely with research, data, and platform engineers to ensure models can scale from prototype to production‑grade training runs. In This Role You Will Own training/inference infrastructure: Design, implement, and maintain systems for large‑scale model training, including scheduling, job management, checkpointing, and metrics/logging. Scale distributed training: Work with researchers to scale JAX‑based training across TPU and GPU clusters with minimal friction. Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization. Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments. Manage compute resources: Ensure efficient allocation and utilization of cloud‑based GPU/TPU compute while controlling cost. Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale. Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics. What We Hope You’ll Bring Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms. Hands‑on large‑scale training experience in JAX (preferred), PyTorch. Familiarity with distributed training, multi‑host setups, data loaders, and evaluation pipelines. Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS). Ability to debug and optimize performance bottlenecks across the training stack. Strong cross‑functional communication and ownership mindset. Bonus Points If You Have Deep ML systems background (e.g., training compilers, runtime optimization, custom kernels). Experience operating close to hardware (GPU/TPU performance tuning). Background in robotics, multimodal models, or large‑scale foundation models. Experience designing abstractions that balance researcher flexibility with system reliability. #J-18808-Ljbffr

Vacancy posted 10 hours ago
Similar jobs that could be interesting for youBased on the ML Infra Engineer (TPU/Jax/Optimization) in San Francisco, CA vacancy
  •  ...San Francisco is seeking a skilled ML Infrastructure Engineer to manage and optimize large-scale training systems. In this...  ...training, ensuring efficient GPU/TPU utilization while working closely with...  ...skills and experience with JAX, distributed training, and cloud platforms... 
    Suggested

    Physical Intelligence

    San Francisco, CA
    3 days ago
  •  ...heterogeneous fleet of GPU and TPU clusters — spanning...  ...The Team The ML Infrastructure team...  ...closely with ML Infra (training systems),...  ...re-scheduling. - Optimize Accelerator...  ...- Strong software engineering fundamentals - Experience...  ...- Familiarity with JAX, PyTorch, or... 
    Suggested
    Flexible hours

    Physical Intelligence

    San Francisco, CA
    9 hours ago
  •  ..., a fast-growing AI company in San Francisco, is hiring a Machine Learning Infra Engineer. This role involves building and maintaining the training and inference frameworks necessary for optimal performance. Ideal candidates should possess strong Python skills, have a background... 
    Suggested

    Reducto

    San Francisco, CA
    9 hours ago
  •  ...company in San Francisco is seeking an engineering professional to develop and manage intelligent...  ...resource allocation across GPU and TPU clusters while enhancing overall system...  ...collaborate closely with researchers to optimize the performance of AI workloads. Competitive... 
    Suggested

    Physical Intelligence

    San Francisco, CA
    10 hours ago
  • $141k - $249k

     .../deployment techniques. - Work with researchers and ML engineers on best-practices for optimal resource usage. - Create and improve tooling and dashboards...  ...in deep learning frameworks such as PyTorch or Jax. - Skilled in profiling CPU and GPU code using tools... 
    Suggested
    Work at office
    Work from home
    Flexible hours

    Waabi

    San Francisco, CA
    2 days ago
  •  ...in San Francisco seeks a Machine Learning Engineer to build and maintain infrastructure for...  ...systems, work closely with researchers, and optimize training processes. Candidates should...  ...software engineering skills and experience with JAX or PyTorch. Join a dynamic team at the... 

    Monograph

    San Francisco, CA
    1 day ago
  •  ...cutting-edge technology company in San Francisco is seeking an ML Infrastructure Engineer to build and scale machine learning systems for real-time...  ...scalable training pipelines for computer vision models, optimizing them for edge devices, and collaborating closely with... 

    Specter Services LLC

    San Francisco, CA
    10 hours ago
  • $100k - $200k

     ...Voiceflow is seeking a skilled ML-Infrastructure Engineer in San Francisco to architect and operate auto-scaling systems for our voice AI simulation platform. The role includes optimizing GPU and compute infrastructure, ensuring high performance and reliability. Ideal... 
    Work at office

    Voiceflow

    San Francisco, CA
    10 hours ago
  •  ...Seattle is looking for a Senior Machine Learning Engineer who will design and implement AI-driven solutions for optimizing their infrastructure. This role requires strong...  ...research. Ideal candidates will bring over 8 years of ML experience, proficiency in PyTorch or TensorFlow... 

    DocuSign

    San Francisco, CA
    10 hours ago
  •  ...Fairygodboss is looking for a Machine Learning Engineer in San Francisco, California, to lead the development of cutting-edge ML systems that enhance customer experience...  ...systems, particularly in personalization, optimization, or causal inference. A comprehensive benefits... 

    Fairygodboss

    San Francisco, CA
    10 hours ago
  • $181.1k - $318.4k

     ...commitment to environmental sustainability and optimal resource utilization. This team plays a...  ...at scale. This team also focuses on ML-driven forecasting, capacity planning, resource...  ...scale services. As a Sr. ML Optimization Engineer, you will work at the intersection of... 
    Relocation

    Apple

    San Francisco, CA
    1 day ago
  •  ...ML Systems Engineer – Robotics & AI We are building the full-stack foundation for the next generation of humanoid robots, from high-performance...  ...modern training stacks such as PyTorch, with familiarity in JAX a plus. Deep understanding of distributed training concepts and... 

    Maxwell Bond

    San Francisco, CA
    1 day ago
  •  ...Zensors is seeking a Machine Learning Engineer focused on ML Runtime & Optimization to enhance our visual sensing platform. The role involves optimizing machine learning pipelines and collaborating with AI research teams to implement high-performance algorithms. Ideal... 

    Zensors

    San Francisco, CA
    9 hours ago
  •  ...Francisco is looking for a Senior Software Engineer to build scalable infrastructure for...  ...design distributed training systems and optimize GPU utilization while collaborating with...  ...candidates have over 5 years of experience in ML infrastructure and a strong background in... 

    BaseTen

    San Francisco, CA
    10 hours ago
  • $160k - $240k

    Tensec is looking for a Machine Learning Engineer in San Francisco to build algorithms and optimization systems that drive their autonomous decision engine. This role involves designing trading strategies, building execution layers, and deploying robust models that enhance... 
    Relocation package

    Tensec

    San Francisco, CA
    4 days ago
  • Responsibilities Design, deploy, and maintain large distributed ML training and inference clusters Develop efficient, scalable end...  ...scales Analyze, profile and debug low-level GPU operations to optimize performance Stay up-to-date on research to bring new ideas to work... 

    Kindredventures

    San Francisco, CA
    3 days ago
  •  ...scale GPU infrastructure. This role requires expertise in deploying GPU systems for high-throughput inference and model performance optimization. The ideal candidate will have hands-on experience with modern inference frameworks and a solid understanding of reinforcement... 

    Reflection AI

    San Francisco, CA
    1 day ago
  •  ...Role We're looking for someone who loves optimizing model inference to join us in building the...  ...complex and bleeding-edge part of our engine. You'll be working on making AI models run...  ...just works You think the current state of ML deployment could be way better What you'll... 

    Comfy

    San Francisco, CA
    9 hours ago
  •  ...fast-growing robotics company in San Francisco seeks an applied ML engineer to design and scale the ML systems for their data platform....  ...-on role focuses on deploying production infrastructure, optimizing inference pipelines, and working on retrieval applications over... 
    Remote work

    Foxglove

    San Francisco, CA
    5 days ago
  •  ...hands-on role focused on scaling and optimizing ML training systems. Key...  ...improving performance, and managing GPU/TPU compute resources. Ideal candidates will have strong software engineering foundations, hands-on experience in JAX and PyTorch, and familiarity with... 

    Physical Intelligence

    San Francisco, CA
    4 days ago
  • Ensure that ML models can be effectively developed, deployed, managed, and monitored in Production environments. Productionize ML...  ...workflow to improve efficiency and reproducibility Performance optimization - identify ways to optimize the performance, efficiency, and scalability... 
    Permanent employment
    Contract work
    Local area

    Cloud Hybrid Technologies, LLC

    San Francisco, CA
    4 days ago
  •  ...looking for a Member of Technical Staff focused on building and optimizing ML inference systems in San Francisco. The role involves...  ...real-world workloads. Candidates should have strong software engineering skills, experience with ML inference systems, and proficiency... 

    Acceler8 Talent

    San Francisco, CA
    5 days ago
  •  ...edge AI research and development. The role involves building and scaling training and inference infrastructure, designing ML kernels, and optimizing performance. Ideal candidates should have a passion for addressing ambitious challenges at the intersection of AI and... 

    Mirendil

    San Francisco, CA
    1 day ago
  • $147.6k - $274k

     ...Job Description: Machine Learning Engineer - Infra San Francisco, CA The Opportunity We are revolutionizing drug discovery...  ...learning techniques. We are seeking a highly motivated and skilled ML Engineer to join our growing team within Genentech Research... 
    Relocation package

    ESR Healthcare

    San Francisco, CA
    1 day ago
  •  ...Benchmark, and First Round Capital, and are hiring a Machine Learning Engineer to help us train and deploy the models critical to the performance of our core product. The Opportunity As an ML Infra Engineer , you’ll play a key role in building the inference and training... 
    Work at office
    Local area

    Reducto

    San Francisco, CA
    1 day ago
  •  ...Mach9 ML Engineer Role At Mach9, ML Engineers build the perception models at the core of...  ...production-quality ML library like PyTorch, JAX, or TensorFlow. Bonus...  ...delivering production-grade models with optimization techniques such as quantization, pruning... 

    Mach9

    San Francisco, CA
    2 days ago
  • $250k - $400k

     ...in isolation. It's building the engine that research runs on. You'...  ...Experience building and scaling ML systems in production Strong...  ...in frameworks like PyTorch, JAX, or similar Strong engineering...  ...Roles available: ML Engineer, ML Infra, Research Engineers & Research... 
    Remote work

    techire ai

    San Francisco, CA
    1 day ago
  •  ...URun in San Francisco is searching for an ML Infrastructure and Platform Engineer. In this role, you will lead the architecture and scaling of our GPU compute platform from the ground up, ensuring high availability and low-latency inference. This is a founding technical... 

    U-Run

    San Francisco, CA
    9 hours ago
  •  ...veterans and innovative thinkers. We don't believe culture can be engineered - but when it falls into place, it's a once-in-a-lifetime...  ...never felt so present. Position Overview We're looking for an ML infrastructure engineer to help design, build, and scale the foundational... 
    Local area

    Humble Robotics

    San Francisco, CA
    1 day ago
  •  ...neurotechnology is seeking a Senior Machine Learning Infrastructure Engineer to design and scale critical infrastructure powering ML applications. This role involves creating robust data pipelines and optimizing modeling processes, essential for developing innovative... 

    Echo Neurotechnologies

    San Francisco, CA
    10 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to ML Infra Engineer (TPU/Jax/Optimization). Be the first to apply!