Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Machine Learning Infrastructure Engineer

$150k

Institute of Foundation Models

About the Institute of Foundation Models We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. As part of our team, you’ll have the opportunity to work on the core of cutting‑edge foundation model training, alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem‑solving skills will be instrumental in establishing MBZUAI as a global hub for high‑performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers. The Role We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side‑by‑side with world‑class researchers and engineers to: Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod) Implement distributed optimizers from mathematical specs Build robust config + launch systems across multi‑node, multi‑GPU clusters Own experiment tracking, metrics logging, and job monitoring for external visibility Improve training system reliability, maintainability, and performance While much of the work will support large‑scale pre‑training, pre‑training experience is not required. Strong infrastructure and systems experience is what we value most. Key Responsibilities Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures. Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations. Launch Config & Debugging – Create and debug multi‑node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets. Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers. Infra Engineering – Write production‑quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale. Qualifications Must-Haves: 5+ years of experience in ML systems, infra, or distributed training Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod) Strong software engineering fundamentals (Python, systems design, testing) Proven multi‑node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO) Ability to implement algorithms across GPUs/nodes based on mathematical specs Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team Experience with large‑scale machine learning workloads (strong ML fundamentals) Nice-to-Haves: Exposure to mixed‑precision training (e.g., bf16, fp8) with accuracy validation Familiarity with performance profiling, kernel fusion, or memory optimization Open‑source contributions or published research (MLSys, ICML, NeurIPS) CUDA or Triton kernel experience Experience with large‑scale pre‑training Experience building custom training pipelines at scale and modifying them for custom needs Deep familiarity with training infrastructure and performance tuning $150,000 - $450,000 a year Benefits Comprehensive medical, dental, and vision 401(k) program Generous PTO, sick leave, and holidays Paid parental leave and family‑friendly benefits On‑site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station #J-18808-Ljbffr Institute of Foundation Models

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Machine Learning Infrastructure Engineer in Sunnyvale, CA vacancy
  • $92k - $138k

     ...Mountain View, CA, USA Machine Learning Engineer, Offline Infrastructure (Entry-Level / New Grad) Location Mountain View, CA, USA Department AI & Machine Learning Requisition ID JOBREQ-2616004 Role description The opportunity Unity Vector builds... 
    Suggested
    Work at office
    Worldwide
    Relocation package

    Unity Technologies

    Mountain View, CA
    1 day ago
  • $160k - $200k

     ...fast-growing teams. As a Senior ML Infrastructure Engineer at Plus, you will design scalable architectures...  ...integrated with state-of-the-art deep learning frameworks like PyTorch or TensorFlow....  ...the boundaries of what's possible in machine learning infrastructure and contribute... 
    Suggested

    PlusAI, Inc.

    Santa Clara, CA
    1 day ago
  •  ...Regional Manager, Sales Engineering - Public Sector As a Regional Manager, Sales Engineering, you will lead a team of Sales Engineers and frontline leaders, driving technical execution, operational excellence, and team development across your region. You’ll act as a force... 
    Suggested
    Internship

    United States Digital Space LLC

    Mountain View, CA
    1 day ago
  • $183.7k - $248.6k

     ...The opportunity Unity is looking for a Senior Machine Learning Infrastructure Engineer to join our Vector Ads team, where we build the real-time systems that power Unity's global advertising platform. This is a high-scale, low-latency environment — processing billions... 
    Suggested
    Work at office
    Remote work
    Worldwide
    Relocation package

    Unity

    Mountain View, CA
    2 days ago
  •  ...Overview As our Senior Staff Software Engineer, ML infra Engineer for Search &...  ...training pipelines * Develop and scale data infrastructure that powers batch and real-time data...  ...of professional experience in applied machine learning * Experience in machine learning,... 
    Suggested
    Full time
    Temporary work
    Flexible hours

    Coupang

    Mountain View, CA
    3 days ago
  •  ...Intuit is seeking a highly motivated and experienced Principal Machine Learning Engineer to join our Mid Market AI team. In this influential role, you will lead the design, development, and deployment of end-to-end AI/ML solutions that power the next generation of intelligent... 

    Intuit Inc.

    Mountain View, CA
    1 day ago
  • $197k - $266.5k

     ...Overview Come join Intuit as a Staff Machine Learning Engineer! In this role, you’ll be embedded inside a vibrant team of data scientists. You’ll be expected to help conceive, code, and deploy data science models at scale using the latest industry tools. Important... 
    Work experience placement
    Shift work

    Intuit Inc.

    Mountain View, CA
    1 day ago
  • $212k - $318.4k

     ...Santa Clara, California, United States Machine Learning and AI Are you interested in enhancing...  ..., including applied machine learning engineers with a focus on ML and LLM, and experienced...  ...with diverse teams, including infrastructure, quality, data, product, and design, to... 
    Work experience placement
    Relocation

    Apple

    Santa Clara, CA
    2 days ago
  •  ...View, CA (any hybrid work will be at the manager’s discretion). W2 Candidates only Position Summary Seeking an experienced Machine Learning Engineer to lead the development of prompt injection and prompt safety models that protect downstream agentic AI systems across... 

    The Fountain Group

    Mountain View, CA
    2 days ago
  •  ...Microsegmentation, Illumio enables Zero Trust, strengthening cyber resilience for the infrastructure, systems, and organizations that keep the world running. Our Team's Vision: Our Engineering team is shaping the future of cybersecurity. We thrive on visionary leadership,... 
    Immediate start

    Illumio

    Sunnyvale, CA
    3 days ago
  •  ...agents that reason, act, and continuously improve. As a Machine Learning Engineer , you won't just build models, you'll architect the entire...  ...thrives in ambiguity and wants to shape foundational AI infrastructure from the ground up. You'll work at the intersection of LLMs... 

    Barker Staffing Solutions LLC

    Mountain View, CA
    2 days ago
  •  ...world’s fastest-growing companies automate, simplify, and accelerate revenue. We are looking for a curious and innovative Machine Learning Engineer to explore, experiment and build AI driven solutions that analyze customer journey and go to market data. The ideal candidate... 
    Full time
    Work at office
    Flexible hours

    LeanData

    Santa Clara, CA
    3 days ago
  • $160k - $225k

     ...will be used to expand our product and engineering teams, bringing our vision of...  ...software has a clear playbook, building the infrastructure for autonomous, intelligent agents is...  ...'re writing the manual. As an early Machine Learning Engineer at MAI, you won't just be writing... 

    MAI

    Mountain View, CA
    2 days ago
  • Machine Learning Engineer, GAI Search Platform - Moveworks Job Description What You Will Do We are looking for Machine learning engineers to...  ...platform team works closely with the ranking, product, design, infrastructure and data science teams to drive our agentic search... 

    Moveworks

    Mountain View, CA
    2 days ago
  •  ...Technical Hiring Criteria (Must Haves) • Top 3 Required skills: Machine Learning, Gen AI, Python • Years of experience in each of the must-...  ...use cases. Create, test, and refine prompts (prompt engineering) including system instructions, chains, and templates to... 
    Hourly pay
    Permanent employment
    Work at office
    Remote work
    3 days per week

    eTeam

    Sunnyvale, CA
    4 days ago
  • $171k - $247k

     ...accessible for all. We are seeking a ML Engineering TL to join the Behavior Planning Team...  ...large-scale models trained with Imitation Learning and Reinforcement Learning that enable...  ...Qualifications ~ MS or PhD in Robotics, Machine Learning, Computer Science, or a related... 
    Work at office
    Local area
    3 days per week

    Aurora Innovation

    Mountain View, CA
    3 days ago
  •  ...About the job Machine Learning Engineer Glint Tech Solutions is Hiring an experienced Machine Learning Engineer to join our client's high-performing team, working on cutting-edge ML infrastructure and scalable cloud-based solutions. What You'll Do: Design... 

    Glint Tech Solutions LLC

    Sunnyvale, CA
    3 days ago
  • $147.4k - $272.1k

     ...Machine Learning Compiler Engineer At Apple, we're on the cutting edge of delivering transformative experiences through Artificial Intelligence. If you're passionate about pushing the boundaries of AI and hardware optimization, we want you to join our team! As a Machine... 
    Relocation

    Apple

    Sunnyvale, CA
    12 hours ago
  • $213k - $263k

     ...Machine Learning Engineer, Simulation Realism Waymo is an autonomous driving technology company with the mission to be the world's most...  ...for long term credit assignment. We also invest in capable infrastructure that allows us to quickly set up and roll out multiple counterfactual... 
    Full time
    Remote work

    Waymo

    Mountain View, CA
    3 days ago
  •  ...delivered for millions of patients worldwide. We're a team of engineers, clinicians, and innovators united by one purpose: to make...  ...of the lung, to take a biopsy at a target location. As a machine learning engineer on the Ion project, you will join a small team of experts... 
    Local area
    Worldwide
    Flexible hours

    Intuitive

    Sunnyvale, CA
    12 hours ago
  • $195k - $230k

     ...with the right team to fulfill our mission: building the infrastructure layer for content intelligence. If you're inspired to...  ...visit About the Role We are looking for a Senior Machine Learning Engineer to help evolve our large-scale recommendation systems... 
    Full time
    Local area
    Work from home

    NewsBreak

    Mountain View, CA
    4 days ago
  • As part of our Silicon Technologies group, you’ll help design and manufacture our next-generation, high-performance, power-efficient processor, system-on-chip (SoC). You’ll ensure Apple products and services can seamlessly and efficiently handle the tasks that make them...

    Apple

    Cupertino, CA
    4 days ago
  • $100.8k - $155.98k

     ...Mountain View, CA, USA Machine Learning Engineer, User Understanding (Entry-Level / New Grad) Location Mountain View, CA, USA Department AI & Machine Learning Requisition ID JOBREQ-2616049 Role description The opportunity Our Gamer AI team develops... 
    Work at office
    Worldwide
    Relocation package

    Unity Technologies

    Mountain View, CA
    3 days ago
  • $181.1k - $318.4k

     ...Sr. Machine Learning Engineer, Siri Speech Join the team redefining what a deeply personal and integrated assistant can be. As part of the Siri organization, you will help shape one of the world's most widely used AI assistants, powered by our next-generation of Apple... 
    Worldwide
    Relocation

    Apple

    Cupertino, CA
    12 hours ago
  • $175k - $230k

     ...are Atoms is building the machines that power the next era of progress...  ...environments, operate them, learn from them, and improve them...  ...scale. We are roboticists, engineers, operators, and builders. We...  ...codelabs, tools, and infrastructure to democratize access to machine... 
    Full time
    Temporary work
    Work at office
    Flexible hours

    ATOMS Careers page

    Mountain View, CA
    4 days ago
  • $165.2k - $223.6k

     ...Description The Product: Amazon's Machine Learning accelerators are at the forefront of our innovation and one of several Amazon's tools...  .... The team covers multiple disciplines including silicon engineering, hardware design and verification, software and operations.... 
    Internship
    Local area
    Flexible hours

    Amazon

    Cupertino, CA
    1 day ago
  • $181.1k - $318.4k

     ...Senior Machine Learning Engineer Imagine what you could do here! The people here at Apple don't just create products — they build the kind...  ...), Generative AI and optimize Apple-wide systems & infrastructure. As a member of the fast-paced team, you will have the outstanding... 
    Work experience placement
    Relocation

    Apple

    Cupertino, CA
    3 days ago
  •  ...Job Title Required Skills: ~12+ years of experience in Client Engineering with experience in NLP ~ Experience in deploying Client models ~ Strong understanding of machine learning principles, especially in the context of LLMs. ~ Experience building chatbots... 

    Syntricate Technologies

    Cupertino, CA
    12 hours ago
  •  ...Senior Machine Learning Engineer It started with a simple idea: what if surgery could be less invasive and recovery less painful? Nearly 30 years later, that question still fuels everything we do at Intuitive. As a global leader in robotic-assisted surgery and minimally... 
    Local area
    Worldwide
    Flexible hours

    Intuitive

    Sunnyvale, CA
    2 days ago
  • $120k - $235k

     ...most innovative companies to build strong engineering teams ready for what's next....  ...integrity signals. Build the evaluation infrastructure, golden datasets, and benchmarking pipelines...  ..., target bonus, and equity. Want to learn more about HackerRank? Check out HackerRank... 
    Shift work

    HackerRank

    Santa Clara, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Machine Learning Infrastructure Engineer. Be the first to apply!