Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Machine Learning Engineer - Training Systems

Rhoda AI

At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality. We're hiring a Staff/Principal ML Systems Engineer to own training performance end-to-end and turn our training platform into a high-efficiency, high-reliability engine for research iteration. What You'll Do Own performance at scale Diagnose and improve end-to-end training performance for large models trained on multimodal robotic data (vision, proprioception, actions, language, video) Build a repeatable workflow for performance attribution: step-time breakdown (compute vs collectives/communication vs dataloader vs checkpointing), scaling curves and bottleneck identification at different GPU counts Drive measurable gains in: Distributed efficiency (overlap, bucket sizing, rank/topology mapping, parallelism strategy) Compute efficiency (kernel hotspots, fusion, attention performance, framework overhead) Memory efficiency (activation checkpointing, packing/bucketing, reduced padding waste) Make performance observable and durable Create "source of truth" metrics and dashboards for both: per-job performance ("why is this run slow?") and fleet-wide performance ("where are we losing GPU-hours this week?") Build automated performance regression detection: microbenchmark suite per model family, CI/perf gates or lightweight canary runs, "golden configs" and standard launch templates Partner deeply with researchers (no silos) Work closely with researchers and research engineers to translate model changes into scalable implementations Provide guidance on training strategy tradeoffs relevant to robotics world models (sequence lengths, rollout/eval cadence, variable-length multimodal data, etc.) Reduce the operational burden on researchers so they can focus on model quality and robotic behavior Collaborate on cluster efficiency (as part of the infra team) Partner with infra/SRE to reduce wasted GPU-hours from: Stragglers and degraded nodes Network health issues Checkpoint stalls and storage bottlenecks Scheduler placement issues for large distributed jobs What We're Looking For Significant experience delivering distributed training performance improvements in production research environments (large-scale GPU training strongly preferred) Strong hands-on experience with modern training stacks (e.g., PyTorch; familiarity with JAX a plus) Deep understanding of distributed training concepts and tradeoffs: sharded training (FSDP/ZeRO-style), tensor/pipeline parallelism, gradient accumulation, comm/compute overlap, and diagnosing and improving collective communication performance Strong debugging and measurement instincts: you can turn ambiguous "it's slow" into a clear bottleneck + experiment plan + validated fix Comfortable operating in a fast-moving startup environment with high ownership and minimal bureaucracy Nice to Have (But Not Required) Experience with GPU kernel-level performance work (CUDA/Triton), fused ops, compiler/graph capture Experience with multimodal/video training and variable-length sequence packing/bucketing Experience building observability systems for ML training (metrics/logs/traces + dashboards + alerting) Familiarity with large-cluster scheduling or topology-aware placement (Slurm/K8s/HPC environments) Why This Role Direct impact on model iteration speed — your work translates directly into faster research cycles and better robotic capability Work at the frontier of large-scale training for real-world robotics, not toy benchmarks Tight collaboration between systems, research, and infrastructure (no silos) High ownership in a small, ambitious team building foundational technology Meaningful leverage: improvements you make compound across every training run the research team executes #J-18808-Ljbffr Rhoda AI

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Machine Learning Engineer - Training Systems in Palo Alto, CA vacancy
  • $213k - $263k

     ...Senior Machine Learning Engineer, Prediction & Planning, System Architecture Waymo is an autonomous driving technology company with the mission to be the world...  ...exact work location, experience, relevant training and education, and skill level. Your recruiter can... 
    Training
    Full time
    Contract work
    Internship
    Remote work

    Waymo

    Mountain View, CA
    4 days ago
  • $200k - $340k

     ...assembling a diverse, world-class team-engineers, designers, researchers, and...  ...full model lifecycle from data and training through evaluation, deployment, and...  ...in AI modeling, applied Machine Learning, or large scale ML systems, with demonstrated ownership of technical... 
    Training
    Full time
    Temporary work
    Local area
    Flexible hours

    HP IQ

    Palo Alto, CA
    1 day ago
  •  ...all of their business systems through natural language...  ...workflows, and continuously learn and adapt. Moveworks...  ...Moveworks' Reasoning Engine and natural language...  ...We are looking for a Machine Learning Engineer to help...  ...including distributed training and inference pipeline... 
    Training
    Work at office
    Remote work
    Flexible hours

    ServiceNow

    Mountain View, CA
    10 hours ago
  •  ...all of their business systems through natural language...  ...workflows, and continuously learn and adapt. Moveworks...  ...Moveworks' Reasoning Engine and natural language...  ...software engineer with machine learning expertise to join...  ...datasets for model training and evaluation. You... 
    Training
    Work at office
    Immediate start
    Remote work
    Flexible hours

    ServiceNow

    Mountain View, CA
    1 day ago
  • $170k - $216k

     ...Perception team builds the system which learns the spatial-temporal representation...  ...set of sensors, enabling engineers like you to (1) develop...  ...) develop models and model training at scale, to (3) analyze...  ...and recipes for human and machine labeling of foundation scale... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    10 hours ago
  • $300k - $400k

     ...scientifically possible. About the Role You will own the systems layer that makes our frontier model training and inference fast, efficient, and tightly coupled...  ...a team of the world's best — the scientists, engineers, and problem-solvers who don't just follow the... 
    Training
    Visa sponsorship
    Flexible hours
    Shift work

    Periodic Labs

    Menlo Park, CA
    2 days ago
  • $126.8k - $220.9k

     ...Machine Learning Systems Engineer, Siri Runtime Systems and Interaction The Siri Team at Apple is actively looking for a highly motivated Systems...  ...building ML infrastructure, evaluation pipelines, or training systems Familiarity with ML frameworks (Core ML, PyTorch... 
    Training
    Relocation

    Apple

    Cupertino, CA
    4 days ago
  •  ...Summary The Siri organization is looking for passionate Machine Learning Systems Engineers to join us in developing and shipping state-of-the-art...  ...that AI brings. The organization is responsible for training on-device & cloud models, evaluating various approaches,... 
    Training

    Apple

    Cupertino, CA
    4 days ago
  • A leading robotics company in Palo Alto seeks a Staff/Principal ML Systems Engineer to enhance training performance for their innovative humanoid robots. You will optimize distributed training systems and engage closely with researchers to transform model changes into... 
    Training

    Rhoda AI

    Palo Alto, CA
    2 days ago
  •  ...Machine Learning Engineer Location: Warren, MI / Mountain View, CA Duration: Fulltime Job...  ...• Work with AWS teams to optimize training pipelines and model integration workflows...  ...plus • Prior work in multi-modal AI systems, VLMs, or robotics perception is highly... 
    Training
    Full time

    JConnect Infotech

    Mountain View, CA
    1 day ago
  •  ...Role Overview: As a Machine Learning Engineer, you will play a central role in translating cutting...  ..., architect robust model-centric systems, and ensure their seamless integration...  ...including synthetic data pipelines, model training, debugging, and performance... 
    Training

    Nace AI

    Palo Alto, CA
    2 days ago
  •  ...Evaluation Engineer Evaluation is the bottleneck in healthcare AI — you can't ship what you can't measure. You'll build the systems that tell us whether our models are safe, accurate, and...  ...by collaborating with the LLM post-training team Collaborate with research,... 
    Training

    Hippocratic AI

    Palo Alto, CA
    1 day ago
  • $230k - $280k

     ...Founding ML Engineer Poesis is building an AI-driven hedge fund...  ..., the first full-time machine learning hire who will turn research...  ...and cleaning data, to model training, validation, and signal generation...  ...time, you'll help scale the system into a full production platform... 
    Training
    Full time
    Relocation package

    Poesis LLC

    Menlo Park, CA
    3 days ago
  • $213k - $263k

     ...Machine Learning Engineer, Simulation Realism Waymo is an autonomous driving technology company...  ...realistic environments for testing and training the Waymo Driver. Our team is a diverse...  ...), roads, traffic control systems, and weather conditions. To increase... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    4 days ago
  • $194k - $214k

     ...highly customer-centric Senior ML Engineer who will join our cross-...  ...ownership of large-scale ML systems, all the way to surfacing the...  ...mentality. Experience with deep learning in a production setting,...  ...understanding how to manage data, training, deployment, and inference at... 
    Training

    Instrumental Inc

    Palo Alto, CA
    4 days ago
  • $213k - $263k

     ...Senior Machine Learning Engineer, Runtime and Serving Waymo is an autonomous driving technology...  ...looking for engineers with ML software & systems expertise to help build the next...  ...work location, experience, relevant training and education, and skill level. Your... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    3 days ago
  • $170k - $216k

     ...velocity. We're looking for a software engineer to join the team to build and maintain...  ...experience Experience with distributed systems principles and experience building...  ...exact work location, experience, relevant training and education, and skill level. Your recruiter... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    1 day ago
  • $204k - $259k

     ...automated algorithms. The learned metrics team is a strategic bet to use machine learning to ensure we...  ...bring ML to production systems and build what is Waymo...  ...models to deliver training and evaluation data for...  ...researchers and software engineers who are passionate about... 
    Training
    Full time
    Work experience placement
    Remote work

    Waymo

    Mountain View, CA
    3 days ago
  •  ...Position Summary Seeking an experienced Machine Learning Engineer to lead the development of prompt injection...  ...models that protect downstream agentic AI systems across phone, cloud, and XR/AR. Role will design, train, and deploy classifier and guardrail models... 
    Training

    The Fountain Group

    Mountain View, CA
    3 days ago
  • $213k - $263k

     ...Machine Learning Engineer, Runtime & Optimization Waymo is an autonomous driving technology company...  ...for engineers with ML software or ML systems expertise to help us improve compute...  ...work location, experience, relevant training and education, and skill level. Your... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    10 hours ago
  • $204k - $259k

     ...Machine Learning Engineer - Mapping Waymo is an autonomous driving technology company with the mission...  ...this role, you will: Design, train, and deploy machine learning models to...  ...working across various parts of the systems stack to deliver results. ~ B.S. in... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    1 day ago
  • $154.9k - $222.37k

     ...decisions. Role Overview: As an ML Engineer on our perception team, you will own the...  ...work will directly shape how our autonomous systems perceive and understand the world. What you will be doing Train and evaluate 3D perception models for object... 
    Training
    Flexible hours

    Aeva, Inc

    Mountain View, CA
    3 days ago
  • $204k - $259k

     ...Senior Machine Learning Engineer, Computer Vision/VLM Waymo is an autonomous driving technology company...  ...scale, serving as the foundation for training and validating the AV stack. We are an...  ...flywheel," continuously improving the system's captioning and reasoning abilities... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    4 days ago
  • $213k - $263k

     ...Senior Machine Learning Engineer, Multimodal Perception (LLM/VLM) Waymo is an autonomous driving technology...  .... You Will: Architect and train large-scale, onboard ML perception...  ...vehicles, robotics, or complex ML systems. ~ Fluency in Python or C++, with deep... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    1 day ago
  • $204k - $259k

     ...Senior Machine Learning Engineer, Perception LLM/VLM Waymo is an autonomous driving technology company...  .... The Perception team builds the system which learns the spatial-temporal...  ...data, to (2) develop models and model training at scale, to (3) analyze real-world behavior... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    10 hours ago
  • $80.15k - $99.77k

     ...data and reporting requirements. Use system reports and analyses to identify...  ...as economics, finance, statistics or engineering. KNOWLEDGE, SKILLS AND ABILITIES (...  ...and promotes safe behaviors based on training and lessons learned. Subject to and expected to comply... 
    Training
    Fixed term contract

    Stanford

    Stanford, CA
    1 day ago
  • $171k - $247k

     ...all. We are seeking a ML Engineering TL to join the Behavior...  ...art for how a self-driving system reasons about the world, interacts...  ...deploy large-scale models trained with Imitation Learning and Reinforcement Learning...  ...~ MS or PhD in Robotics, Machine Learning, Computer Science... 
    Training
    Work at office
    Local area
    3 days per week

    Aurora Innovation

    Mountain View, CA
    4 days ago
  • $196k - $221k

     ...alongside industry-veteran scientists and engineers. As a Machine Learning Engineer, you'll bring your strong...  ...order to scale and optimize our ML systems-creating and transforming innovative...  ...the design and implementation of training, fine-tuning, post-training, and... 
    Training
    Permanent employment

    Otter.ai

    Mountain View, CA
    1 day ago
  • $175k - $215k

     ...Machine Learning Engineer, Prediction & Planning Waymo is an autonomous driving technology company...  ...generation ML-powered prediction and planning system to enhance the performance and...  ...exact work location, experience, relevant training and education, and skill level. Your... 
    Training
    Full time
    Remote work

    Waymo

    Mountain View, CA
    2 days ago
  • $195k - $230k

     ...powered by advanced AI, recommendation systems, and adtech. Recognized by Fast...  ...Role We are looking for a Senior Machine Learning Engineer to help evolve our large-scale...  ...metrics. Own systems from offline training → online inference → A/B experimentation... 
    Training
    Full time
    Local area
    Work from home

    NewsBreak

    Mountain View, CA
    10 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Machine Learning Engineer - Training Systems. Be the first to apply!