Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Machine Learning Infrastructure Engineer

$150k

Institute of Foundation Models

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting‑edge foundation model training, alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem‑solving skills will be instrumental in establishing MBZUAI as a global hub for high‑performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side‑by‑side with world‑class researchers and engineers to:

  • Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
  • Implement distributed optimizers from mathematical specs
  • Build robust config + launch systems across multi‑node, multi‑GPU clusters
  • Own experiment tracking, metrics logging, and job monitoring for external visibility
  • Improve training system reliability, maintainability, and performance
  • While much of the work will support large‑scale pre‑training, pre‑training experience is not required. Strong infrastructure and systems experience is what we value most.

Key Responsibilities

  • Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures.
  • Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations.
  • Launch Config & Debugging – Create and debug multi‑node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets.
  • Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers.
  • Infra Engineering – Write production‑quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale.

Qualifications

Must-Haves:
  • 5+ years of experience in ML systems, infra, or distributed training
  • Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
  • Strong software engineering fundamentals (Python, systems design, testing)
  • Proven multi‑node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)
  • Ability to implement algorithms across GPUs/nodes based on mathematical specs
  • Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team
  • Experience with large‑scale machine learning workloads (strong ML fundamentals)
Nice-to-Haves:
  • Exposure to mixed‑precision training (e.g., bf16, fp8) with accuracy validation
  • Familiarity with performance profiling, kernel fusion, or memory optimization
  • Open‑source contributions or published research (MLSys, ICML, NeurIPS)
  • CUDA or Triton kernel experience
  • Experience with large‑scale pre‑training
  • Experience building custom training pipelines at scale and modifying them for custom needs
  • Deep familiarity with training infrastructure and performance tuning

$150,000 - $450,000 a year

Benefits

  • Comprehensive medical, dental, and vision
  • 401(k) program
  • Generous PTO, sick leave, and holidays
  • Paid parental leave and family‑friendly benefits
  • On‑site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station
#J-18808-Ljbffr

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Machine Learning Infrastructure Engineer in Sunnyvale, CA vacancy
  • $92k - $138k

     ...Mountain View, CA, USA Machine Learning Engineer, Offline Infrastructure (Entry-Level / New Grad) Location Mountain View, CA, USA Department AI & Machine Learning Requisition ID JOBREQ-2616004 Role description The opportunity Unity Vector builds... 
    Suggested
    Work at office
    Worldwide
    Relocation package

    Unity Technologies

    Mountain View, CA
    6 days ago
  • $170k - $240k

     ...impact delivering-driven expert in ML Training Infrastructure with a strong ability to execute hands-on technical...  ...model development initiatives. As a Senior ML Engineer, you will collaborate closely with machine learning engineers, research scientists, and other partners... 
    Suggested
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Mountain View, CA
    6 days ago
  •  ...time trading, all backed by robust data infrastructure. The Role Arta is building the AI...  ...# System Design Interview with VP of Engineering, 60m # Co-founder Interview with Head...  ..., collaboration, and continuous learning are highly valued ~ The opportunity... 
    Suggested
    Work at office
    Remote work
    Relocation

    Arta Finance

    Mountain View, CA
    5 days ago
  • $230k - $300k

     ...with the right team to fulfill our mission: building the infrastructure layer for content intelligence. If you're inspired to...  ...information, visit About the Role We are seeking a Staff Machine Learning Engineer to provide technical leadership for our recommendation... 
    Suggested
    Full time
    Local area
    Work from home

    NewsBreak

    Mountain View, CA
    4 days ago
  •  ...Machine Learning Infrastructure Engineer At Mind Robotics, we're building generalized physical AI—robotic systems capable of dexterous, adaptive, and reasoning-intensive work in real-world industrial environments. Our ability to iterate quickly on large-scale models... 
    Suggested

    Mind Robotics

    Palo Alto, CA
    2 days ago
  • $153.2k - $234.1k

     ...autonomous driving? Join the Embodied AI Infra Foundation team at General Motors, where we build the critical infrastructure that powers every machine learning engineer working on our cutting-edge Autonomous Driving models. From foundational models to state-of-the-art... 
    Work at office
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    2 days ago
  •  ...efforts. We’re proud to serve as the infrastructure platform for teams developing autonomous...  ...validation of state-of-the-art (SOTA) machine learning models, with a focus on performance,...  ...seeking a Senior ML Infrastructure engineer to help build and scale robust Compute... 
    Local area
    Work from home

    General Motors

    Sunnyvale, CA
    3 days ago
  • $153.2k - $234.1k

     ...Our team is developing and deploying machine learning solutions that support safe and reliable...  ...scenarios. As a Senior ML Infra Engineer, you will work on the core systems that...  ...distributed systems, applications, or ML infrastructure. ~ Experience designing robust... 
    Local area
    Remote work
    Work from home
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    3 days ago
  •  ...Intuit is seeking a highly motivated and experienced Principal Machine Learning Engineer to join our Mid Market AI team. In this influential role, you will lead the design, development, and deployment of end-to-end AI/ML solutions that power the next generation of intelligent... 

    Intuit Inc.

    Mountain View, CA
    1 day ago
  • X Development, LLC in Mountain View, CA, is looking for a Software Engineer to join their Machine Learning team. You will design and maintain CI/CD pipelines for ML workflows, manage ML model deployments, and collaborate with a multidisciplinary team. The ideal candidate... 
    Flexible hours

    X Development, LLC

    Mountain View, CA
    3 days ago
  • $197k - $266.5k

     ...Overview Come join Intuit as a Staff Machine Learning Engineer! In this role, you’ll be embedded inside a vibrant team of data scientists. You’ll be expected to help conceive, code, and deploy data science models at scale using the latest industry tools. Important... 
    Work experience placement
    Shift work

    Intuit Inc.

    Mountain View, CA
    1 day ago
  •  ...Job Description Job Description Machine Learning Engineer This is an opportunity with an early stage startup.(M-F, in Mountain View,...  ...group and level up the team's knowledge of LLM training and infrastructure About you Strong software engineering skills. There... 
    Work at office

    Amiri Recruiting

    Mountain View, CA
    28 days ago
  •  ...class researchers, data scientists, and engineers, tackling the most fundamental and impactful...  ...for high-performance computing in deep learning, driving impactful discoveries that...  ...forefront of optimizing performance for the machine learning software stacks, especially at... 
    Work experience placement
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    8 days ago
  •  ...researchers, data scientists, and engineers, tackling the most...  ...performance computing in deep learning, driving impactful discoveries...  ...pioneers. The Role As a Machine Learning Engineer at the...  ...Machine Learning (ML) models, ML infrastructure, Natural Language Processing... 
    Worldwide
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    a month ago
  • $140k - $220k

     ...Job Description ABOUT US E-commerce got real-time data infrastructure decades ago. Physical stores still have not. RADAR is changing...  ...and needs. ABOUT THE JOB We are looking for a Machine Learning Engineer to help build and develop our ML capabilities at RADAR. The... 
    Work at office
    Flexible hours

    RADAR

    Sunnyvale, CA
    a month ago
  • $195k - $230k

     ...with the right team to fulfill our mission: building the infrastructure layer for content intelligence. If you're inspired to...  ...visit About the Role We are looking for a Senior Machine Learning Engineer to help evolve our large-scale recommendation systems... 
    Full time
    Local area
    Work from home

    NewsBreak

    Mountain View, CA
    4 days ago
  •  ...Machine Learning Engineer LeanData helps the world's fastest-growing companies automate, simplify, and accelerate revenue. We are looking for a curious and innovative Machine Learning Engineer to explore, experiment and build AI driven solutions that analyze customer... 
    Full time
    Work at office
    Flexible hours

    LeanData

    Santa Clara, CA
    4 days ago
  • $230k - $265k

     ...ML and work alongside industry-veteran scientists and engineers. As a Senior Machine Learning Engineer, you’ll bring your strong software...  ...-term maintenance. Partner deeply with product, and infrastructure teams to develop and translate cutting-edge research... 
    Permanent employment

    Otter.ai

    Mountain View, CA
    1 day ago
  •  ...MACHINE LEARNING ENGINEER (Contextual) Background: AnchorFree is a fast growing technology company in Silicon Valley that makes a significant impact on people's lives around the globe by enabling free access to all information and content online and enabling millions... 
    Relocation package

    AnchorFree

    Mountain View, CA
    2 days ago
  • $171k - $247k

     ...accessible for all. We are seeking a ML Engineering TL to join the Behavior Planning Team...  ...large-scale models trained with Imitation Learning and Reinforcement Learning that enable...  ...Qualifications ~ MS or PhD in Robotics, Machine Learning, Computer Science, or a related... 
    Work at office
    Local area
    3 days per week

    Aurora Innovation

    Mountain View, CA
    8 days ago
  • $170k - $216k

     ...Perception Machine Learning Engineer Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver...  ...data, large model training running on Alphabet's compute infrastructure, create methods and recipes for pre-training and post-... 
    Full time
    Remote work

    Waymo

    Mountain View, CA
    3 days ago
  • $170k - $216k

     ...Machine Learning Engineer Perception LLM/VLM (PhD, New Grad) Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo... 
    Full time
    Remote work

    Waymo

    Mountain View, CA
    2 days ago
  •  ...delivered for millions of patients worldwide. We're a team of engineers, clinicians, and innovators united by one purpose: to make...  ..., prototype, and implement advanced computer vision and machine learning algorithms tailored for real-time processing of diverse... 
    Local area
    Worldwide
    Flexible hours

    Intuitive

    Sunnyvale, CA
    1 day ago
  •  ...About the job Machine Learning Engineer Glint Tech Solutions is Hiring an experienced Machine Learning Engineer to join our client's high-performing team, working on cutting-edge ML infrastructure and scalable cloud-based solutions. What You'll Do: Design... 

    Glint Tech Solutions LLC

    Sunnyvale, CA
    3 days ago
  •  ...streamline complex workflows, and continuously learn and adapt. Moveworks is trusted by...  ...automation with Moveworks’ Reasoning Engine and natural language capabilities, we deliver...  ...Our product excels in using cutting-edge Machine Learning technologies, particularly... 
    Work at office
    Remote work
    Flexible hours

    ServiceNow

    Mountain View, CA
    5 days ago
  • $213k - $263k

     ...Machine Learning Engineer, Runtime & Optimization Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo Driver... 
    Full time
    Remote work

    Waymo

    Mountain View, CA
    4 days ago
  • $120k - $235k

     ...most innovative companies to build strong engineering teams ready for what’s next. Software...  .... Not just model capability, but the infrastructure that makes the 200,000th interview as coherent...  ..., target bonus, and equity. Want to learn more about HackerRank? Check out... 
    Shift work

    HackerRank

    Santa Clara, CA
    16 days ago
  •  ...Microsegmentation, Illumio enables Zero Trust, strengthening cyber resilience for the infrastructure, systems, and organizations that keep the world running. Our Team's Vision: Our Engineering team is shaping the future of cybersecurity. We thrive on visionary leadership... 
    Immediate start

    Illumio

    Sunnyvale, CA
    5 days ago
  • $100.8k - $155.98k

     ...Mountain View, CA, USA Machine Learning Engineer, User Understanding (Entry-Level / New Grad) Location Mountain View, CA, USA Department AI & Machine Learning Requisition ID JOBREQ-2616049 Role description The opportunity Our Gamer AI team develops... 
    Work at office
    Worldwide
    Relocation package

    Unity Technologies

    Mountain View, CA
    3 days ago
  •  ...we invite you to join our Conversation Engine team. At our company, you'll have the...  ...problems. You'll collaborate closely with machine learning experts and cross-functional teams,...  ..., and enhance our end-to-end product infrastructure with the utmost engineering quality and... 
    Work at office
    Remote work
    Flexible hours

    ServiceNow

    Mountain View, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Machine Learning Infrastructure Engineer. Be the first to apply!