Machine Learning Engineer - Training Systems

Rhoda AI

At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality. We're hiring a Staff/Principal ML Systems Engineer to own training performance end-to-end and turn our training platform into a high-efficiency, high-reliability engine for research iteration. What You'll Do Own performance at scale Diagnose and improve end-to-end training performance for large models trained on multimodal robotic data (vision, proprioception, actions, language, video) Build a repeatable workflow for performance attribution: step-time breakdown (compute vs collectives/communication vs dataloader vs checkpointing), scaling curves and bottleneck identification at different GPU counts Drive measurable gains in: Distributed efficiency (overlap, bucket sizing, rank/topology mapping, parallelism strategy) Compute efficiency (kernel hotspots, fusion, attention performance, framework overhead) Memory efficiency (activation checkpointing, packing/bucketing, reduced padding waste) Make performance observable and durable Create "source of truth" metrics and dashboards for both: per-job performance ("why is this run slow?") and fleet-wide performance ("where are we losing GPU-hours this week?") Build automated performance regression detection: microbenchmark suite per model family, CI/perf gates or lightweight canary runs, "golden configs" and standard launch templates Partner deeply with researchers (no silos) Work closely with researchers and research engineers to translate model changes into scalable implementations Provide guidance on training strategy tradeoffs relevant to robotics world models (sequence lengths, rollout/eval cadence, variable-length multimodal data, etc.) Reduce the operational burden on researchers so they can focus on model quality and robotic behavior Collaborate on cluster efficiency (as part of the infra team) Partner with infra/SRE to reduce wasted GPU-hours from: Stragglers and degraded nodes Network health issues Checkpoint stalls and storage bottlenecks Scheduler placement issues for large distributed jobs What We're Looking For Significant experience delivering distributed training performance improvements in production research environments (large-scale GPU training strongly preferred) Strong hands-on experience with modern training stacks (e.g., PyTorch; familiarity with JAX a plus) Deep understanding of distributed training concepts and tradeoffs: sharded training (FSDP/ZeRO-style), tensor/pipeline parallelism, gradient accumulation, comm/compute overlap, and diagnosing and improving collective communication performance Strong debugging and measurement instincts: you can turn ambiguous "it's slow" into a clear bottleneck + experiment plan + validated fix Comfortable operating in a fast-moving startup environment with high ownership and minimal bureaucracy Nice to Have (But Not Required) Experience with GPU kernel-level performance work (CUDA/Triton), fused ops, compiler/graph capture Experience with multimodal/video training and variable-length sequence packing/bucketing Experience building observability systems for ML training (metrics/logs/traces + dashboards + alerting) Familiarity with large-cluster scheduling or topology-aware placement (Slurm/K8s/HPC environments) Why This Role Direct impact on model iteration speed — your work translates directly into faster research cycles and better robotic capability Work at the frontier of large-scale training for real-world robotics, not toy benchmarks Tight collaboration between systems, research, and infrastructure (no silos) High ownership in a small, ambitious team building foundational technology Meaningful leverage: improvements you make compound across every training run the research team executes #J-18808-Ljbffr Rhoda AI

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Machine Learning Engineer - Training Systems in Palo Alto, CA vacancy

Senior Machine Learning Engineer, Agentic AI Systems - Moveworks
...all of their business systems through natural language... ...workflows, and continuously learn and adapt. Moveworks... ...Moveworks' Reasoning Engine and natural language... ...software engineer with machine learning expertise to join... ...datasets for model training and evaluation. You...
Training
Work at office
Immediate start
Remote work
Flexible hours
ServiceNow
Mountain View, CA
4 days ago
Senior Machine Learning Engineer, Agentic Systems - Moveworks
...all of their business systems through natural language... ...workflows, and continuously learn and adapt. Moveworks... ...Moveworks' Reasoning Engine and natural language... ...We are looking for a Machine Learning Engineer to help... ...including distributed training and inference pipeline...
Training
Work at office
Remote work
Flexible hours
ServiceNow
Mountain View, CA
3 days ago
Machine Learning Engineer, Next-Generation Recommendation Systems (New Grad / PhD)
$112.7k - $169.1k
...Mountain View, CA, USA Machine Learning Engineer, Next-Generation Recommendation Systems (New Grad / PhD) Location Mountain View, CA, USA Department... ...production, working across the full pipeline from training data to deployed model. Communicate findings...
Training
Internship
Work at office
Worldwide
Relocation package
Shift work
Unity Technologies
Mountain View, CA
3 days ago
ML Systems Engineer
$300k - $400k
...scientifically possible. About the Role You will own the systems layer that makes our frontier model training and inference fast, efficient, and tightly coupled... ...a team of the world's best — the scientists, engineers, and problem-solvers who don't just follow the...
Training
Visa sponsorship
Flexible hours
Shift work
Periodic Labs
Menlo Park, CA
22 hours ago
Principal Machine Learning Engineer
...highly motivated and experienced Principal Machine Learning Engineer to join our Mid Market AI team. In this... ...technical roadmap. End-to-End ML Systems: Lead the full lifecycle of ML solutions—from data curation and model training to robust deployment and monitoring in...
Training
Intuit Inc.
Mountain View, CA
22 hours ago
Staff Machine Learning Engineer
$197k - $266.5k
...Overview Come join Intuit as a Staff Machine Learning Engineer! In this role, you’ll be embedded... ...underlying data and build pipelines to train and deploy models. Partner with data... ...fundamentals: version control systems (i.e. Git, Github) and workflows, and...
Training
Work experience placement
Shift work
Intuit Inc.
Mountain View, CA
22 hours ago
Machine Learning Engineer
...Job Description Job Description Machine Learning Engineer This is an opportunity with an early stage startup.(M-F, in Mountain View, CA)... ...build AI superpowers for developers. What you'll do Train and fine-tune large language models Navigate high levels...
Training
Work at office
Amiri Recruiting
Mountain View, CA
27 days ago
Senior ML Systems Engineer — Distributed Training at Scale
A leading robotics company in Palo Alto seeks a Staff/Principal ML Systems Engineer to enhance training performance for their innovative humanoid robots. You will optimize distributed training systems and engage closely with researchers to transform model changes into...
Training
Rhoda AI
Palo Alto, CA
22 hours ago
Senior Machine Learning Engineer
...Evaluation Engineer Evaluation is the bottleneck in healthcare AI — you can't ship what you can't measure. You'll build the systems that tell us whether our models are safe, accurate, and... ...by collaborating with the LLM post-training team Collaborate with research,...
Training
Hippocratic AI
Palo Alto, CA
19 days ago
Machine Learning Engineer
...Role Overview: As a Machine Learning Engineer, you will play a central role in translating cutting... ..., architect robust model-centric systems, and ensure their seamless integration... ...including synthetic data pipelines, model training, debugging, and performance...
Training
Nace AI
Palo Alto, CA
22 hours ago
Perception Machine Learning Engineer
$170k - $216k
...Perception Machine Learning Engineer Waymo is an autonomous driving technology company with the mission... .... The Perception team builds the system which learns the spatial-temporal... ...data, to (2) develop models and model training at scale, to (3) analyze real-world behavior...
Training
Full time
Remote work
Waymo
Mountain View, CA
2 days ago
Senior Machine Learning Engineer
$194k - $214k
...highly customer-centric Senior ML Engineer who will join our cross-... ...ownership of large-scale ML systems, all the way to surfacing the... ...mentality. Experience with deep learning in a production setting,... ...understanding how to manage data, training, deployment, and inference at...
Training
Instrumental Inc
Palo Alto, CA
2 days ago
Machine Learning Engineer, Runtime & Optimization
$213k - $263k
...Machine Learning Engineer, Runtime & Optimization Waymo is an autonomous driving technology company... ...for engineers with ML software or ML systems expertise to help us improve compute... ...work location, experience, relevant training and education, and skill level. Your...
Training
Full time
Remote work
Waymo
Mountain View, CA
3 days ago
Machine Learning Engineer Perception LLM/VLM (PhD, New Grad)
$170k - $216k
...Machine Learning Engineer Perception LLM/VLM (PhD, New Grad) Waymo is an autonomous driving technology... .... The Perception team builds the system which learns the spatial-temporal... ...data, to (2) develop models and model training at scale, to (3) analyze real-world behavior...
Training
Full time
Remote work
Waymo
Mountain View, CA
1 day ago
Machine Learning Engineering TL, Behavior Planning
$171k - $247k
...all. We are seeking a ML Engineering TL to join the Behavior... ...art for how a self-driving system reasons about the world, interacts... ...deploy large-scale models trained with Imitation Learning and Reinforcement Learning... ...~ MS or PhD in Robotics, Machine Learning, Computer Science...
Training
Work at office
Local area
3 days per week
Aurora Innovation
Mountain View, CA
7 days ago
Senior Machine Learning Engineer, Recommendation & AI Applications
$195k - $230k
...powered by advanced AI, recommendation systems, and adtech. Recognized by Fast... ...Role We are looking for a Senior Machine Learning Engineer to help evolve our large-scale... ...metrics. Own systems from offline training → online inference → A/B experimentation...
Training
Full time
Local area
Work from home
NewsBreak
Mountain View, CA
3 days ago
Machine Learning Engineer, Simulation Realism
$213k - $263k
...Machine Learning Engineer, Simulation Realism Waymo is an autonomous driving technology company... ...realistic environments for testing and training the Waymo Driver. Our team is a diverse... ...), roads, traffic control systems, and weather conditions. To increase...
Training
Full time
Remote work
Waymo
Mountain View, CA
2 days ago
Senior Machine Learning Engineer
$230k - $265k
...alongside industry-veteran scientists and engineers. As a Senior Machine Learning Engineer, you’ll bring your strong... ...order to scale and optimize our ML systems—creating and transforming... ...Lead the design and implementation of training, fine-tuning, post-training, and inference...
Training
Permanent employment
Otter.ai
Mountain View, CA
22 hours ago
Distinguished Machine Learning Engineer
$210k - $350k
...Rewards, and Great Careers. Distinguished Engineer GEICO is seeking a Distinguished... ...durable, scalable, and extensible AI systems that underpin multiple lines of business... ...candidate's work experience, education and training, the work location as well as market and...
Training
Hourly pay
Work experience placement
Local area
GEICO
Palo Alto, CA
3 days ago
Senior Machine Learning Engineer
$133.95k - $245k
...re looking for an exceptional Senior Machine Learning Engineer to help shape the future of our core... ...reinforced learning. Improving evaluation and training or finetune models for product use... ...and deploying machine learning systems using production-grade frameworks for...
Training
Work at office
Remote work
Flexible hours
Shift work
3 days per week
Robinhood
Menlo Park, CA
3 days ago
Machine Learning Engineer
...Poesis Machine Learning Engineer At Poesis, machine learning and artificial intelligence open the... ...Learning Engineer to help build the systems that make this possible. In this role... ...reproducible workflows for feature generation, training, validation and evaluation Work...
Training
Full time
Work at office
Visa sponsorship
Work visa
Relocation package
3 days per week
Poesis LLC
Menlo Park, CA
1 day ago
Senior Machine Learning Engineer
...company in healthcare. We have the only system that can have safe, autonomous, clinical conversations with patients. We have trained our own LLMs as part of our Polaris constellation... ...-judge systems. This is a high-leverage engineering role where your work directly gates what...
Training
Work at office
Hippocratic AI
Menlo Park, CA
2 days ago
Machine Learning Engineer, User Understanding (Entry-Level / New Grad)
$100.8k - $155.98k
...Mountain View, CA, USA Machine Learning Engineer, User Understanding (Entry-Level / New Grad)... ...power our ad ranking and recommendation systems. By leveraging large-scale data and contextual... ...Global Employee Assistance Program | Training and development programs |...
Training
Work at office
Worldwide
Relocation package
Unity Technologies
Mountain View, CA
2 days ago
Senior Machine Learning Engineer (Mandarin speaking)
$200k - $250k
...Senior Machine Learning Engineer (Mandarin Speaking) Menlo Park, California, United States; Seattle... ...built some of the most successful ad systems at Google, including YouTube's... ...systematic, reproducible approaches to model training, evaluation, and continuous...
Training
Temporary work
Work at office
Flexible hours
Moloco
Menlo Park, CA
3 days ago
Senior Machine Learning Engineer, Conversion Modeling
$172.2k - $258.4k
...opportunity We are looking for a Staff Machine Learning Engineer to join our Vector Core Modeling team... ...and build scalable machine learning systems that power ad ranking in large-scale... ...Global Employee Assistance Program | Training and development programs |...
Training
Work at office
Worldwide
Relocation package
Unity
Mountain View, CA
2 days ago
Senior Machine Learning Engineer, Advertiser Growth
$140.7k - $223.4k
...Mountain View, CA, USA Senior Machine Learning Engineer, Advertiser Growth Location Mountain... .... You will architect the GenAI systems that build high-performing creatives,... ...Global Employee Assistance Program | Training and development programs | Volunteering...
Training
Work at office
Worldwide
Relocation package
Unity Technologies
Mountain View, CA
22 hours ago
Senior Machine Learning Engineer, Ads Experimentation & Measurements
$148.7k - $258.72k
...Mountain View, CA, USA Senior Machine Learning Engineer, Ads Experimentation & Measurements Location... ...evaluating pacing and ad selection systems, with strong domain knowledge of the... ...lifecycle, from initial model training to live production auctions. What...
Training
Temporary work
Work at office
Worldwide
Relocation package
Unity Technologies
Mountain View, CA
3 days ago
Staff ML Systems Engineer — Distributed Training at Scale
A leading AI infrastructure company in California seeks a Member of Technical Staff — Training to design and optimize large-scale distributed training systems for frontier AI models. Candidates should have 5+ years of experience in ML systems and be proficient in Python...
Training
RadixArk
Palo Alto, CA
1 day ago
Machine Learning Engineer
Machine Learning Engineer One of the first ML Engineers at a 25-person rocketship automating a $1T industry... ...with robust pipelines and ML serving systems. Build a suite of powerful, reliable... .... Build MLOps infrastructure for training, fine‑tuning, and deploying state‑of‑...
Training
Work at office
Aionia Group
Mountain View, CA
1 day ago
Machine Learning Infrastructure Engineer
...Machine Learning Infrastructure Engineer At Mind Robotics, we're building generalized physical AI—robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world... ...fast, reliable, and scalable model training—powering everything from...
Training
Mind Robotics
Palo Alto, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Machine Learning Engineer - Training Systems. Be the first to apply!