Machine Learning Engineer - Training Systems
Rhoda ai
At Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality. We're hiring a Staff/Principal ML Systems Engineer to own training performance end-to-end and turn our training platform into a high-efficiency, high-reliability engine for research iteration. What You'll Do Own performance at scale Diagnose and improve end-to-end training performance for large models trained on multimodal robotic data (vision, proprioception, actions, language, video) Build a repeatable workflow for performance attribution: step-time breakdown (compute vs collectives/communication vs dataloader vs checkpointing), scaling curves and bottleneck identification at different GPU counts Drive measurable gains in: Distributed efficiency (overlap, bucket sizing, rank/topology mapping, parallelism strategy) Compute efficiency (kernel hotspots, fusion, attention performance, framework overhead) Memory efficiency (activation checkpointing, packing/bucketing, reduced padding waste) Make performance observable and durable Create "source of truth" metrics and dashboards for both: per-job performance ("why is this run slow?") and fleet-wide performance ("where are we losing GPU-hours this week?") Build automated performance regression detection: microbenchmark suite per model family, CI/perf gates or lightweight canary runs, "golden configs" and standard launch templates Partner deeply with researchers (no silos) Work closely with researchers and research engineers to translate model changes into scalable implementations Provide guidance on training strategy tradeoffs relevant to robotics world models (sequence lengths, rollout/eval cadence, variable-length multimodal data, etc.) Reduce the operational burden on researchers so they can focus on model quality and robotic behavior Collaborate on cluster efficiency (as part of the infra team) Partner with infra/SRE to reduce wasted GPU-hours from: Stragglers and degraded nodes Network health issues Checkpoint stalls and storage bottlenecks Scheduler placement issues for large distributed jobs What We're Looking For Significant experience delivering distributed training performance improvements in production research environments (large-scale GPU training strongly preferred) Strong hands-on experience with modern training stacks (e.g., PyTorch; familiarity with JAX a plus) Deep understanding of distributed training concepts and tradeoffs: sharded training (FSDP/ZeRO-style), tensor/pipeline parallelism, gradient accumulation, comm/compute overlap, and diagnosing and improving collective communication performance Strong debugging and measurement instincts: you can turn ambiguous "it's slow" into a clear bottleneck + experiment plan + validated fix Comfortable operating in a fast-moving startup environment with high ownership and minimal bureaucracy Nice to Have (But Not Required) Experience with GPU kernel-level performance work (CUDA/Triton), fused ops, compiler/graph capture Experience with multimodal/video training and variable-length sequence packing/bucketing Experience building observability systems for ML training (metrics/logs/traces + dashboards + alerting) Familiarity with large-cluster scheduling or topology-aware placement (Slurm/K8s/HPC environments) Why This Role Direct impact on model iteration speed — your work translates directly into faster research cycles and better robotic capability Work at the frontier of large-scale training for real-world robotics, not toy benchmarks Tight collaboration between systems, research, and infrastructure (no silos) High ownership in a small, ambitious team building foundational technology Meaningful leverage: improvements you make compound across every training run the research team executes #J-18808-Ljbffr
- ...all of their business systems through natural language... ...workflows, and continuously learn and adapt. Moveworks... ...Moveworks' Reasoning Engine and natural language... ...We are looking for a Machine Learning Engineer to help... ...including distributed training and inference pipeline...TrainingWork at officeRemote workFlexible hours
- ...all of their business systems through natural language... ...workflows, and continuously learn and adapt. Moveworks... ...Moveworks' Reasoning Engine and natural language... ...software engineer with machine learning expertise to join... ...datasets for model training and evaluation. You...TrainingWork at officeImmediate startRemote workFlexible hours
$200k - $340k
...assembling a diverse, world-class team-engineers, designers, researchers, and... ...full model lifecycle from data and training through evaluation, deployment, and... ...in AI modeling, applied Machine Learning, or large scale ML systems, with demonstrated ownership of technical...TrainingFull timeTemporary workLocal areaFlexible hours- ...A leading robotics company in Palo Alto seeks a Staff/Principal ML Systems Engineer to enhance training performance for their innovative humanoid robots. You will optimize distributed training systems and engage closely with researchers to transform model changes into...Training
$300k - $400k
...scientifically possible. About the Role You will own the systems layer that makes our frontier model training and inference fast, efficient, and tightly coupled... ...a team of the world's best — the scientists, engineers, and problem-solvers who don't just follow the...TrainingVisa sponsorshipFlexible hoursShift work$147.4k - $272.1k
...Description As a Machine Learning Systems Engineer, you will work closely with Siri modeling teams and other cross-functional teams to optimize model training and inference. You will be working across the ML stack at Apple, finding opportunities to make models performant...TrainingRelocation- ...only Position Summary Seeking an experienced Machine Learning Engineer to lead the development of prompt injection and... ...models that protect downstream agentic AI systems across phone, cloud, and XR/AR. Role will design, train, and deploy classifier and guardrail models (...Training
- ...company in healthcare. We have the only system that can have safe, autonomous, clinical conversations with patients. We have trained our own LLMs as part of our Polaris constellation... ...-judge systems. This is a high-leverage engineering role where your work directly gates what...TrainingWork at office
- ...building the next generation of agentic AI systems, intelligent, autonomous agents that... ...act, and continuously improve. As a Machine Learning Engineer , you won't just build models, you'll... ...to design infrastructure for training, fine-tuning, evaluation, and deployment...Training
$154.9k - $222.37k
...Machine Learning Engineer - Perception Mountain View, CA About us: Aeva’s mission is to bring the next wave of... ...work will directly shape how our autonomous systems perceive and understand the world. What you will be doing Train and evaluate 3D perception models for object...TrainingFlexible hours$187k - $220k
...Senior Machine Learning Engineer, Agentic Join us in building the future of finance. Our mission is... ...use into reliable, production‑ready systems. What you bring Strong technical expertise... ..., which may include education, training, experience, location, business needs...TrainingWork at officeShift work$133.95k - $245k
...re looking for an exceptional Senior Machine Learning Engineer to help shape the future of our core... ...reinforced learning. Improving evaluation and training or finetune models for product use... ...and deploying machine learning systems using production‑grade frameworks for...TrainingWork at officeRemote workFlexible hoursShift work3 days per week$160k - $225k
...used to expand our product and engineering teams, bringing our vision... ...the manual. As an early Machine Learning Engineer at MAI, you won't just... ...reason, to the scalable serving systems that deliver their... ...the MLOps infrastructure for training, fine‑tuning, and deploying...Training- ...Role Overview: As a Machine Learning Engineer, you will play a central role in translating cutting... ..., architect robust model-centric systems, and ensure their seamless integration... ...including synthetic data pipelines, model training, debugging, and performance...Training
- ...Evaluation Engineer Evaluation is the bottleneck in healthcare AI — you can't ship what you can't measure. You'll build the systems that tell us whether our models are safe, accurate, and... ...by collaborating with the LLM post-training team Collaborate with research,...Training
$213k - $263k
...Machine Learning Engineer, Simulation Realism Waymo is an autonomous driving technology company... ...realistic environments for testing and training the Waymo Driver. Our team is a diverse... ...), roads, traffic control systems, and weather conditions. To increase...TrainingFull timeRemote work$170k - $216k
...Perception Machine Learning Engineer Waymo is an autonomous driving technology company with the mission... .... The Perception team builds the system which learns the spatial-temporal... ...data, to (2) develop models and model training at scale, to (3) analyze real-world behavior...TrainingFull timeRemote work$170k - $216k
...Perception team builds the system which learns the spatial-temporal representation... ...set of sensors, enabling engineers like you to (1) develop... ...) develop models and model training at scale, to (3) analyze... ...~3+ years experience in Machine Learning and/or Computer Vision...TrainingFull timeTemporary workRemote work$230k - $280k
...Founding ML Engineer Poesis is building an AI-driven hedge fund... ..., the first full-time machine learning hire who will turn research... ...and cleaning data, to model training, validation, and signal generation... ...time, you'll help scale the system into a full production platform...TrainingFull timeRelocation package- ...Machine Learning Engineer One of the first ML Engineers at a 25-person rocketship automating a $1T... ...with robust pipelines and ML serving systems. Build a suite of powerful, reliable,... ...agents. Build MLOps infrastructure for training, fine‑tuning, and deploying state‑of‑...TrainingWork at office
$194k - $214k
...highly customer-centric Senior ML Engineer who will join our cross-... ...ownership of large-scale ML systems, all the way to surfacing the... ...mentality. Experience with deep learning in a production setting,... ...understanding how to manage data, training, deployment, and inference at...Training- ...We are seeking an experienced GenAI engineer to join our seasoned founding team to... ...knowledge graphs, and multimodal extraction systems for enterprise use cases. This role... ...infrastructure to support machine learning training, inference, and evaluation. Hands‑on...Training
$213k - $263k
...Machine Learning Engineer, Runtime & Optimization Waymo is an autonomous driving technology company... ...for engineers with ML software or ML systems expertise to help us improve compute... ...work location, experience, relevant training and education, and skill level. Your...TrainingFull timeRemote work$195k - $230k
...powered by advanced AI, recommendation systems, and adtech. Recognized by Fast... ...Role We are looking for a Senior Machine Learning Engineer to help evolve our large-scale... ...metrics. Own systems from offline training → online inference → A/B experimentation...TrainingFull timeLocal areaWork from home$170.6k - $261.3k
...from breakthrough hardware and battery systems to intuitive design, intelligent... ...on a global scale. Role As a Senior Machine Learning Engineer for Perception within the EmbodiedAI... ...End Model Lifecycle: Own the design, training, validation, and deployment of deep learning...TrainingRemote workRelocation packageFlexible hours$171k - $247k
...all. We are seeking a ML Engineering TL to join the Behavior... ...art for how a self-driving system reasons about the world, interacts... ...deploy large-scale models trained with Imitation Learning and Reinforcement Learning... ...~ MS or PhD in Robotics, Machine Learning, Computer Science...TrainingWork at officeLocal area3 days per week$210k - $350k
...Rewards, and Great Careers. Distinguished Engineer GEICO is seeking a Distinguished... ...durable, scalable, and extensible AI systems that underpin multiple lines of business... ...candidate's work experience, education and training, the work location as well as market and...TrainingHourly payWork experience placementLocal area$204k - $259k
...Foundations team is to develop machine learning solutions addressing open... ...to a Senior Staff Software Engineer. You will: Work with a creative... ...are used throughout Waymo’s systems, both onboard autonomous... ...entire life-cycle from pre-training and supervised fine-tuning (...TrainingFull timeTemporary workRemote work$176k - $420k
...is looking for an experienced applied Machine Learning Engineers to help build models that deliver... ...access to one of the world's largest training clusters. Most importantly, you will... ...end-to-end learning based self-driving system Use cutting-edge techniques from...TrainingHourly payFull timeTemporary workFlexible hours$172.2k - $258.4k
...ranking and recommendation systems. These models help advertisers... ...a highly skilled Senior ML Engineer to drive state-of-the-art bidding... ...Ph.D.) in Computer Science, Machine Learning, Statistics, related field,... ...Assistance Program | Training and development programs | Volunteering...TrainingWork at officeWorldwideRelocation package
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Machine Learning Engineer - Training Systems. Be the first to apply!
- machine learning software engineer Palo Alto, CA
- ai ml engineer Palo Alto, CA
- computer vision machine learning engineer Palo Alto, CA
- machine learning engineer Palo Alto, CA
- senior ml engineer Palo Alto, CA
- machine learning ai engineer Palo Alto, CA
- healthcare systems engineer Palo Alto, CA
- application system engineer Palo Alto, CA
- operating system engineer Palo Alto, CA
- space systems engineer Palo Alto, CA

