Machine Learning Infrastructure Engineer
$150kInstitute of Foundation Models
About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.
As part of our team, you’ll have the opportunity to work on the core of cutting‑edge foundation model training, alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem‑solving skills will be instrumental in establishing MBZUAI as a global hub for high‑performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.
The Role
We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side‑by‑side with world‑class researchers and engineers to:
- Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
- Implement distributed optimizers from mathematical specs
- Build robust config + launch systems across multi‑node, multi‑GPU clusters
- Own experiment tracking, metrics logging, and job monitoring for external visibility
- Improve training system reliability, maintainability, and performance
- While much of the work will support large‑scale pre‑training, pre‑training experience is not required. Strong infrastructure and systems experience is what we value most.
Key Responsibilities
- Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures.
- Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations.
- Launch Config & Debugging – Create and debug multi‑node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets.
- Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers.
- Infra Engineering – Write production‑quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale.
Qualifications
Must-Haves:- 5+ years of experience in ML systems, infra, or distributed training
- Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
- Strong software engineering fundamentals (Python, systems design, testing)
- Proven multi‑node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)
- Ability to implement algorithms across GPUs/nodes based on mathematical specs
- Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team
- Experience with large‑scale machine learning workloads (strong ML fundamentals)
- Exposure to mixed‑precision training (e.g., bf16, fp8) with accuracy validation
- Familiarity with performance profiling, kernel fusion, or memory optimization
- Open‑source contributions or published research (MLSys, ICML, NeurIPS)
- CUDA or Triton kernel experience
- Experience with large‑scale pre‑training
- Experience building custom training pipelines at scale and modifying them for custom needs
- Deep familiarity with training infrastructure and performance tuning
$150,000 - $450,000 a year
Benefits
- Comprehensive medical, dental, and vision
- 401(k) program
- Generous PTO, sick leave, and holidays
- Paid parental leave and family‑friendly benefits
- On‑site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station
$92k - $138k
...Mountain View, CA, USA Machine Learning Engineer, Offline Infrastructure (Entry-Level / New Grad) Location Mountain View, CA, USA Department AI & Machine Learning Requisition ID JOBREQ-2616004 Role description The opportunity Unity Vector builds...SuggestedWork at officeWorldwideRelocation package$170k - $240k
...impact delivering-driven expert in ML Training Infrastructure with a strong ability to execute hands-on technical... ...model development initiatives. As a Senior ML Engineer, you will collaborate closely with machine learning engineers, research scientists, and other partners...SuggestedLocal areaRemote workWork from homeRelocationRelocation packageFlexible hours- ...time trading, all backed by robust data infrastructure. The Role Arta is building the AI... ...# System Design Interview with VP of Engineering, 60m # Co-founder Interview with Head... ..., collaboration, and continuous learning are highly valued ~ The opportunity...SuggestedWork at officeRemote workRelocation
$230k - $300k
...with the right team to fulfill our mission: building the infrastructure layer for content intelligence. If you're inspired to... ...information, visit About the Role We are seeking a Staff Machine Learning Engineer to provide technical leadership for our recommendation...SuggestedFull timeLocal areaWork from home- ...Machine Learning Infrastructure Engineer At Mind Robotics, we're building generalized physical AI—robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world industrial environments. Our ability to iterate quickly on large-scale models...Suggested
$153.2k - $234.1k
...autonomous driving? Join the Embodied AI Infra Foundation team at General Motors, where we build the critical infrastructure that powers every machine learning engineer working on our cutting-edge Autonomous Driving models. From foundational models to state-of-the-art...Work at officeLocal areaRemote workWork from homeRelocationRelocation packageFlexible hours- ...efforts. We’re proud to serve as the infrastructure platform for teams developing autonomous... ...validation of state-of-the-art (SOTA) machine learning models, with a focus on performance,... ...seeking a Senior ML Infrastructure engineer to help build and scale robust Compute...Local areaWork from home
$153.2k - $234.1k
...Our team is developing and deploying machine learning solutions that support safe and reliable... ...scenarios. As a Senior ML Infra Engineer, you will work on the core systems that... ...distributed systems, applications, or ML infrastructure. ~ Experience designing robust...Local areaRemote workWork from homeRelocation packageFlexible hours- ...Intuit is seeking a highly motivated and experienced Principal Machine Learning Engineer to join our Mid Market AI team. In this influential role, you will lead the design, development, and deployment of end-to-end AI/ML solutions that power the next generation of intelligent...
- X Development, LLC in Mountain View, CA, is looking for a Software Engineer to join their Machine Learning team. You will design and maintain CI/CD pipelines for ML workflows, manage ML model deployments, and collaborate with a multidisciplinary team. The ideal candidate...Flexible hours
$197k - $266.5k
...Overview Come join Intuit as a Staff Machine Learning Engineer! In this role, you’ll be embedded inside a vibrant team of data scientists. You’ll be expected to help conceive, code, and deploy data science models at scale using the latest industry tools. Important...Work experience placementShift work- ...Job Description Job Description Machine Learning Engineer This is an opportunity with an early stage startup.(M-F, in Mountain View,... ...group and level up the team's knowledge of LLM training and infrastructure About you Strong software engineering skills. There...Work at office
- ...class researchers, data scientists, and engineers, tackling the most fundamental and impactful... ...for high-performance computing in deep learning, driving impactful discoveries that... ...forefront of optimizing performance for the machine learning software stacks, especially at...Work experience placementVisa sponsorship
- ...researchers, data scientists, and engineers, tackling the most... ...performance computing in deep learning, driving impactful discoveries... ...pioneers. The Role As a Machine Learning Engineer at the... ...Machine Learning (ML) models, ML infrastructure, Natural Language Processing...WorldwideVisa sponsorship
$140k - $220k
...Job Description ABOUT US E-commerce got real-time data infrastructure decades ago. Physical stores still have not. RADAR is changing... ...and needs. ABOUT THE JOB We are looking for a Machine Learning Engineer to help build and develop our ML capabilities at RADAR. The...Work at officeFlexible hours$195k - $230k
...with the right team to fulfill our mission: building the infrastructure layer for content intelligence. If you're inspired to... ...visit About the Role We are looking for a Senior Machine Learning Engineer to help evolve our large-scale recommendation systems...Full timeLocal areaWork from home- ...Machine Learning Engineer LeanData helps the world's fastest-growing companies automate, simplify, and accelerate revenue. We are looking for a curious and innovative Machine Learning Engineer to explore, experiment and build AI driven solutions that analyze customer...Full timeWork at officeFlexible hours
$230k - $265k
...ML and work alongside industry-veteran scientists and engineers. As a Senior Machine Learning Engineer, you’ll bring your strong software... ...-term maintenance. Partner deeply with product, and infrastructure teams to develop and translate cutting-edge research...Permanent employment- ...MACHINE LEARNING ENGINEER (Contextual) Background: AnchorFree is a fast growing technology company in Silicon Valley that makes a significant impact on people's lives around the globe by enabling free access to all information and content online and enabling millions...Relocation package
$171k - $247k
...accessible for all. We are seeking a ML Engineering TL to join the Behavior Planning Team... ...large-scale models trained with Imitation Learning and Reinforcement Learning that enable... ...Qualifications ~ MS or PhD in Robotics, Machine Learning, Computer Science, or a related...Work at officeLocal area3 days per week$170k - $216k
...Perception Machine Learning Engineer Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver... ...data, large model training running on Alphabet's compute infrastructure, create methods and recipes for pre-training and post-...Full timeRemote work$170k - $216k
...Machine Learning Engineer Perception LLM/VLM (PhD, New Grad) Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo...Full timeRemote work- ...delivered for millions of patients worldwide. We're a team of engineers, clinicians, and innovators united by one purpose: to make... ..., prototype, and implement advanced computer vision and machine learning algorithms tailored for real-time processing of diverse...Local areaWorldwideFlexible hours
- ...About the job Machine Learning Engineer Glint Tech Solutions is Hiring an experienced Machine Learning Engineer to join our client's high-performing team, working on cutting-edge ML infrastructure and scalable cloud-based solutions. What You'll Do: Design...
- ...streamline complex workflows, and continuously learn and adapt. Moveworks is trusted by... ...automation with Moveworks’ Reasoning Engine and natural language capabilities, we deliver... ...Our product excels in using cutting-edge Machine Learning technologies, particularly...Work at officeRemote workFlexible hours
$213k - $263k
...Machine Learning Engineer, Runtime & Optimization Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo Driver...Full timeRemote work$120k - $235k
...most innovative companies to build strong engineering teams ready for what’s next. Software... .... Not just model capability, but the infrastructure that makes the 200,000th interview as coherent... ..., target bonus, and equity. Want to learn more about HackerRank? Check out...Shift work- ...Microsegmentation, Illumio enables Zero Trust, strengthening cyber resilience for the infrastructure, systems, and organizations that keep the world running. Our Team's Vision: Our Engineering team is shaping the future of cybersecurity. We thrive on visionary leadership...Immediate start
$100.8k - $155.98k
...Mountain View, CA, USA Machine Learning Engineer, User Understanding (Entry-Level / New Grad) Location Mountain View, CA, USA Department AI & Machine Learning Requisition ID JOBREQ-2616049 Role description The opportunity Our Gamer AI team develops...Work at officeWorldwideRelocation package- ...we invite you to join our Conversation Engine team. At our company, you'll have the... ...problems. You'll collaborate closely with machine learning experts and cross-functional teams,... ..., and enhance our end-to-end product infrastructure with the utmost engineering quality and...Work at officeRemote workFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Machine Learning Infrastructure Engineer. Be the first to apply!
- computer vision machine learning engineer Sunnyvale, CA
- machine learning ai engineer Sunnyvale, CA
- senior ml engineer Sunnyvale, CA
- machine learning software engineer Sunnyvale, CA
- machine learning engineer Sunnyvale, CA
- ai ml engineer Sunnyvale, CA
- security infrastructure engineer Sunnyvale, CA
- senior infrastructure engineer Sunnyvale, CA
- remote infrastructure engineer Sunnyvale, CA
- infrastructure engineering manager Sunnyvale, CA



