Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Machine Learning Infrastructure Engineer

$150k

Institute of Foundation Models

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting‑edge foundation model training, alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem‑solving skills will be instrumental in establishing MBZUAI as a global hub for high‑performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side‑by‑side with world‑class researchers and engineers to:

  • Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
  • Implement distributed optimizers from mathematical specs
  • Build robust config + launch systems across multi‑node, multi‑GPU clusters
  • Own experiment tracking, metrics logging, and job monitoring for external visibility
  • Improve training system reliability, maintainability, and performance
  • While much of the work will support large‑scale pre‑training, pre‑training experience is not required. Strong infrastructure and systems experience is what we value most.

Key Responsibilities

  • Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures.
  • Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations.
  • Launch Config & Debugging – Create and debug multi‑node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets.
  • Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers.
  • Infra Engineering – Write production‑quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale.

Qualifications

Must-Haves:
  • 5+ years of experience in ML systems, infra, or distributed training
  • Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)
  • Strong software engineering fundamentals (Python, systems design, testing)
  • Proven multi‑node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)
  • Ability to implement algorithms across GPUs/nodes based on mathematical specs
  • Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team
  • Experience with large‑scale machine learning workloads (strong ML fundamentals)
Nice-to-Haves:
  • Exposure to mixed‑precision training (e.g., bf16, fp8) with accuracy validation
  • Familiarity with performance profiling, kernel fusion, or memory optimization
  • Open‑source contributions or published research (MLSys, ICML, NeurIPS)
  • CUDA or Triton kernel experience
  • Experience with large‑scale pre‑training
  • Experience building custom training pipelines at scale and modifying them for custom needs
  • Deep familiarity with training infrastructure and performance tuning

$150,000 - $450,000 a year

Benefits

  • Comprehensive medical, dental, and vision
  • 401(k) program
  • Generous PTO, sick leave, and holidays
  • Paid parental leave and family‑friendly benefits
  • On‑site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station
#J-18808-Ljbffr

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Machine Learning Infrastructure Engineer in Sunnyvale, CA vacancy
  •  ...tools by working on pioneering technologies to surprise and delight creative pros and enthusiasts alike. As a Machine Learning Infrastructure engineer, you will be working alongside world-class engineers and creatives to help innovate in the creative space in ways... 
    Suggested

    Apple

    Cupertino, CA
    5 days ago
  • $160k - $200k

     ...fast-growing teams. As a Senior ML Infrastructure Engineer at Plus, you will design scalable architectures...  ...integrated with state-of-the-art deep learning frameworks like PyTorch or TensorFlow....  ...the boundaries of what's possible in machine learning infrastructure and contribute... 
    Suggested

    PlusAI, Inc.

    Santa Clara, CA
    1 day ago
  •  ...exceptional individuals to extend the core technology that let Siri understand, learn, and remember. You will be part of a cross-functional team consisting of software engineers as well as data and machine learning engineers/scientists and having a large impact on the Siri... 
    Suggested
    Worldwide

    Apple

    Cupertino, CA
    4 days ago
  • $183.7k - $248.6k

    The opportunity Unity is looking for a Senior Machine Learning Infrastructure Engineer to join our Vector Ads team, where we build the real-time systems that power Unity's global advertising platform. This is a high-scale, low-latency environment — processing billions... 
    Suggested
    Work at office
    Remote work
    Worldwide
    Relocation package

    Unity

    Mountain View, CA
    2 days ago
  • $209.7k - $283.8k

     ...Mountain View, CA, USA Staff Machine Learning Engineer, ML Infrastructure Location Mountain View, CA, USA Department AI & Machine Learning Requisition ID JOBREQ-2615904 Role description The opportunity Unity Vector builds an offline ML platform... 
    Suggested
    Work at office
    Worldwide
    Relocation package

    Unity Technologies

    Mountain View, CA
    3 days ago
  • $19 - $65 per hour

     ...join its fast-growing teams. Ready to get hands-on with real-world, large-scale data challenges? We’re seeking a Machine Learning Infrastructure Engineer Intern to join us in a project that focuses on identifying the bottlenecks and implementing high-performance custom... 
    Hourly pay
    Internship

    PlusAI

    Santa Clara, CA
    1 day ago
  • $147.4k - $272.1k

     ...Machine Learning Infrastructure And Data Engineer Join us as an ML Data and Infrastructure Engineer and become the architect behind the data infrastructure that power tomorrow's breakthrough AI/ML innovations. You'll be the critical link between ambitious algorithmic... 
    Relocation

    Apple

    Sunnyvale, CA
    3 days ago
  • $152k - $241.5k

     ...Join our team of innovative engineers who are building an AI Data Center AIOps platform that turns raw, high-volume telemetry into...  ...detecting anomalies and surfacing insights across massive-scale infrastructure before they impact AI training and inference. The core... 

    NVIDIA

    Santa Clara, CA
    12 days ago
  • $126k - $181.5k

     ...Software Engineering Mountain View, California Machine Learning Engineering TL, Behavior Planning Who we are Aurora’s mission is to deliver the benefits of self-driving technology safely, quickly, and broadly. The Aurora Driver will create a new era in mobility... 
    Local area

    Australian Competition and Consumer Commission

    Mountain View, CA
    1 day ago
  • $150k

     ...researchers, data scientists, and engineers, tackling the most...  ...performance computing in deep learning, driving impactful discoveries...  ...pioneers. The Role As a Machine Learning Engineer at the...  ...Machine Learning (ML) models, ML infrastructure, Natural Language Processing... 
    Worldwide
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    1 day ago
  • $181.1k - $318.4k

     ...Cupertino, California, United States Machine Learning and AI Play a part in building the next revolution of machine learning technology...  ...problems, collaborate with world‑class machine learning engineers and researchers to impact the future of Apple products, and... 
    Relocation

    Apple Inc.

    Cupertino, CA
    1 day ago
  •  ...Applied SciML (Scientific Machine Learning) Engineer Santa Clara, United States | Posted on 10/16/2025 Role: Applied SciML (Scientific Machine Learning) Engineer Location: Santa Clara, CA MUST HAVE: Job Description ~ We're looking for an engineer with deep... 

    Govserviceshub

    Santa Clara, CA
    1 day ago
  • $150k

     ...class researchers, data scientists, and engineers, tackling the most fundamental and impactful...  ...for high‑performance computing in deep learning, driving impactful discoveries that...  ...forefront of optimizing performance for the machine learning software stacks, especially at... 
    Work experience placement
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    1 day ago
  •  ...Senior Machine Learning Engineer, Search & Knowledge Platforms Are you passionate about search technologies and building knowledge experiences...  ...who collaborate closely with product, data science, and infrastructure teams to power and enhance features across Apple products... 

    Apple

    Santa Clara, CA
    1 day ago
  • $212k - $318.4k

     ...Senior Machine Learning Platform Engineer - AI, Search & Knowledge Work Locations (2) Submit Resume Join us in building the AI, Search & Knowledge...  ...-driven experiences to focus on innovation rather than infrastructure complexity. In this role, you'll build and scale... 
    Relocation

    Apple

    Cupertino, CA
    2 days ago
  •  ...Job Description What You Will Do We are looking for Machine learning engineers to join our GenAI Search Platform team to improve our...  ...platform team works closely with the ranking, product, design, infrastructure and data science teams to drive our agentic search... 
    Work at office
    Remote work
    Flexible hours

    ServiceNow

    Mountain View, CA
    3 days ago
  • As a machine learning platform engineer of the Samsung Ads Platform Intelligence (PI) team, you will have access to unique Samsung proprietary data to develop and deploy a wide spectrum of large-scale machine learning products with real-world impact.

    Samsung Electronics

    Mountain View, CA
    1 day ago
  • $147.4k - $272.1k

     ...Machine Learning Engineer, Apple Search & Knowledge Platforms Work Locations (2) Submit Resume The Apple Knowledge Quality Team is building the next-generation of machine learning solutions for Knowledge Q&A at Apple and help power features including Siri and Spotlight... 
    Work experience placement
    Relocation

    Apple

    Santa Clara, CA
    3 days ago
  • $181.1k - $318.4k

     ...Senior Machine Learning Engineer, Apple Search & Knowledge Platforms Work Locations (2) Submit Resume The AI, Search & Knowledge Platforms...  ...iMessage, and operates the foundational platforms and infrastructure that keep these intelligent experiences running at... 
    Relocation

    Apple

    Santa Clara, CA
    2 days ago
  • $147.4k - $272.1k

     ...Machine Learning Engineer, Platform Architecture At Apple, our Platform Architecture group is responsible for connecting our hardware and software into one unified system! You'll collaborate with engineers across Apple to design how our technologies work in unison,... 
    Relocation

    Apple

    Cupertino, CA
    5 days ago
  •  ...Expertise in most of the following areas: supervised & unsupervised learning, deep learning, reinforcement learning, federated learning,...  ...Cloud Migration, Custom Software Development, Data Analytics Infrastructure & Cloud Solutions, Cyber Security Services, etc. We make... 

    InterSources

    Santa Clara, CA
    5 days ago
  •  ...About the job Machine Learning Engineer (Agentic AI Platform) About the Role We're building the next generation of agentic AI...  ...thrives in ambiguity and wants to shape foundational AI infrastructure from the ground up. You'll work at the intersection of... 

    Barker Staffing Solutions, LLC

    Mountain View, CA
    2 days ago
  • $181.1k - $318.4k

     ...Staff Machine Learning Engineer: Platform Intelligence - Apple Maps Apple Maps and the thousands of applications it empowers are being used by millions every single day! As a fundamental tool for human activity, Maps technology is evolving and new techniques are emerging... 
    Relocation

    Apple

    Cupertino, CA
    1 day ago
  • $212k - $386.3k

     ...Sr Staff Machine Learning Engineer, ML Platform Work Locations (2) Submit Resume At Apple, we work every day to create products that...  ...providing a self-serve, unified platform with foundational infrastructure in model training, inference and agentic AI, as well as associated... 
    Immediate start
    Relocation

    Apple

    Cupertino, CA
    5 days ago
  • $212k - $386.3k

     ...Senior Staff Machine Learning Engineer, Apple Search & Knowledge Platforms Apple is where individual imaginations gather together, contributing...  ...iMessage, and operates the foundational platforms and infrastructure that keep these intelligent experiences running at... 
    Temporary work
    Worldwide
    Relocation

    Apple

    Santa Clara, CA
    4 days ago
  • $212k - $386.3k

     ...Staff Machine Learning Engineer, Apple Search & Knowledge Platforms The AI, Search & Knowledge Platforms team builds amazing products and...  ...and iMessage, and operates the foundational platforms and infrastructure that keep these intelligent experiences running at... 
    Relocation

    Apple

    Santa Clara, CA
    2 days ago
  •  ...time trading, all backed by robust data infrastructure. The Role Arta is building the AI...  ...# System Design Interview with VP of Engineering, 60m # Co-founder Interview with Head...  ..., collaboration, and continuous learning are highly valued ~ The opportunity... 
    Work at office
    Remote work
    Relocation

    Arta Finance

    Mountain View, CA
    5 days ago
  • $157.2k - $254.1k

     ...posture from development through runtime. As a Principal Machine Learning Inference Engineer, you will serve as a technical authority and visionary...  ...Triton Language, is a plus. Experience with data infrastructure technologies (e.g., Kafka, Spark, Flink) is great to... 
    Full time
    Work at office

    Palo Alto Networks

    Santa Clara, CA
    5 days ago
  • $230k - $300k

     ...with the right team to fulfill our mission: building the infrastructure layer for content intelligence. If you're inspired to...  ...information, visit About the Role We are seeking a Staff Machine Learning Engineer to provide technical leadership for our recommendation... 
    Full time
    Local area
    Work from home

    NewsBreak

    Mountain View, CA
    4 days ago
  • $257k

     ...About the Role We are looking for an experienced Senior Staff Machine Learning Engineer to join the Account Integrity team within Trusted Identity engineering org at Uber. The Trusted Identity org plays a crucial role in our mission to empower users with secure and... 

    Uber

    Sunnyvale, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Machine Learning Infrastructure Engineer. Be the first to apply!