Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Machine Learning Infrastructure Engineer

$150k

Institute of Foundation Models

About the Institute of Foundation Models We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy. As part of our team, you’ll have the opportunity to work on the core of cutting‑edge foundation model training, alongside world‑class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem‑solving skills will be instrumental in establishing MBZUAI as a global hub for high‑performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers. The Role We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side‑by‑side with world‑class researchers and engineers to: Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod) Implement distributed optimizers from mathematical specs Build robust config + launch systems across multi‑node, multi‑GPU clusters Own experiment tracking, metrics logging, and job monitoring for external visibility Improve training system reliability, maintainability, and performance While much of the work will support large‑scale pre‑training, pre‑training experience is not required. Strong infrastructure and systems experience is what we value most. Key Responsibilities Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures. Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations. Launch Config & Debugging – Create and debug multi‑node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets. Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers. Infra Engineering – Write production‑quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale. Qualifications Must-Haves: 5+ years of experience in ML systems, infra, or distributed training Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod) Strong software engineering fundamentals (Python, systems design, testing) Proven multi‑node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO) Ability to implement algorithms across GPUs/nodes based on mathematical specs Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team Experience with large‑scale machine learning workloads (strong ML fundamentals) Nice-to-Haves: Exposure to mixed‑precision training (e.g., bf16, fp8) with accuracy validation Familiarity with performance profiling, kernel fusion, or memory optimization Open‑source contributions or published research (MLSys, ICML, NeurIPS) CUDA or Triton kernel experience Experience with large‑scale pre‑training Experience building custom training pipelines at scale and modifying them for custom needs Deep familiarity with training infrastructure and performance tuning $150,000 - $450,000 a year Benefits Comprehensive medical, dental, and vision 401(k) program Generous PTO, sick leave, and holidays Paid parental leave and family‑friendly benefits On‑site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station #J-18808-Ljbffr

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Machine Learning Infrastructure Engineer in Sunnyvale, CA vacancy
  • $162.5k - $286.4k

     ...Sr. Machine Learning Engineer, ASR Infrastructure and Tools Cambridge, Massachusetts, United States Machine Learning and AI Want to join the team pushing the boundaries of AI and building an intelligent assistant that helps millions of people get things done? Join the... 
    Suggested
    Worldwide
    Relocation

    Apple

    Cupertino, CA
    1 day ago
  • $160k - $200k

     ...fast-growing teams. As a Senior ML Infrastructure Engineer at Plus, you will design scalable architectures...  ...integrated with state-of-the-art deep learning frameworks like PyTorch or TensorFlow....  ...the boundaries of what's possible in machine learning infrastructure and contribute... 
    Suggested

    PlusAI, Inc.

    Santa Clara, CA
    4 days ago
  •  ...tools by working on pioneering technologies to surprise and delight creative pros and enthusiasts alike. As a Machine Learning Infrastructure engineer, you will be working alongside world-class engineers and creatives to help innovate in the creative space in ways... 
    Suggested

    Apple

    Cupertino, CA
    3 days ago
  • $92k - $138k

     ...streaming data, supporting analytics, product intelligence, machine learning pipelines, and business operations. As data volume...  ...ML systems. We’re looking for a Machine Learning Engineer to join our Offline Infrastructure team. This is an ideal role for a recent university... 
    Suggested
    Work at office
    Worldwide
    Relocation package

    Unity

    Mountain View, CA
    3 days ago
  • $170k - $240k

     ...impact delivering-driven expert in ML Training Infrastructure with a strong ability to execute hands-on technical...  ...model development initiatives. As a Senior ML Engineer, you will collaborate closely with machine learning engineers, research scientists, and other partners... 
    Suggested
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Mountain View, CA
    4 days ago
  • $19 - $65 per hour

     ...join its fast-growing teams. Ready to get hands-on with real-world, large-scale data challenges? We’re seeking a Machine Learning Infrastructure Engineer Intern to join us in a project that focuses on identifying the bottlenecks and implementing high-performance custom... 
    Hourly pay
    Internship

    PlusAI

    Santa Clara, CA
    4 days ago
  • $185k - $335.3k

     ..., impact-driven expert in ML Training Infrastructure with a demonstrated ability to lead through...  ...development at scale.As a Staff ML Engineer, you will operate as a technical...  ...initiatives, partnering closely with machine learning engineers, research scientists, and platform... 
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    1 day ago
  • $152k - $241.5k

     ...Join our team of innovative engineers who are building an AI Data Center AIOps platform that turns raw, high-volume telemetry into...  ...detecting anomalies and surfacing insights across massive-scale infrastructure before they impact AI training and inference. The core... 

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $181.1k - $272.1k

     ...Senior Machine Learning Engineer, Search & Knowledge Platforms Santa Clara, California, United States Software and Services Are you passionate...  ...who collaborate closely with product, data science, and infrastructure teams to power and enhance features across Apple products... 
    Relocation

    Apple

    Santa Clara, CA
    23 hours ago
  •  ...X Development, LLC in Mountain View, CA, is looking for a Software Engineer to join their Machine Learning team. You will design and maintain CI/CD pipelines for ML workflows, manage ML model deployments, and collaborate with a multidisciplinary team. The ideal candidate... 
    Flexible hours

    X Development, LLC

    Mountain View, CA
    23 hours ago
  • $212k - $318.4k

     ...Senior Machine Learning Platform Engineer - AI, Search & Knowledge Work Locations (2) Submit Resume Join us in building the AI, Search & Knowledge...  ...-driven experiences to focus on innovation rather than infrastructure complexity. In this role, you'll build and scale... 
    Relocation

    Apple

    Cupertino, CA
    5 days ago
  •  ...Senior Machine Learning Engineer, Search & Knowledge Platforms Are you passionate about search technologies and building knowledge experiences...  ...who collaborate closely with product, data science, and infrastructure teams to power and enhance features across Apple products... 

    Apple

    Santa Clara, CA
    4 days ago
  • $147.4k - $272.1k

     ...Machine Learning Engineer, Apple Search & Knowledge Platforms Work Locations (2) Submit Resume The Apple Knowledge Quality Team is building the next-generation of machine learning solutions for Knowledge Q&A at Apple and help power features including Siri and Spotlight... 
    Work experience placement
    Relocation

    Apple

    Santa Clara, CA
    1 day ago
  • $181.1k - $318.4k

     ...Senior Machine Learning Engineer, Apple Search & Knowledge Platforms Work Locations (2) Submit Resume The AI, Search & Knowledge Platforms...  ...iMessage, and operates the foundational platforms and infrastructure that keep these intelligent experiences running at... 
    Relocation

    Apple

    Santa Clara, CA
    5 days ago
  •  ...Expertise in most of the following areas: supervised & unsupervised learning, deep learning, reinforcement learning, federated learning,...  ...Cloud Migration, Custom Software Development, Data Analytics Infrastructure & Cloud Solutions, Cyber Security Services, etc. We make... 

    InterSources

    Santa Clara, CA
    3 days ago
  • $212k - $386.3k

     ...Senior Staff Machine Learning Engineer, Apple Search & Knowledge Platforms Santa Clara, California, United States Machine Learning and AI Apple...  ...iMessage, and operates the foundational platforms and infrastructure that keep these intelligent experiences running at hyperscale... 
    Temporary work
    Worldwide
    Relocation

    Apple

    Santa Clara, CA
    23 hours ago
  • $212k - $386.3k

     ...Sr Staff Machine Learning Engineer, ML Platform Work Locations (2) Submit Resume At Apple, we work every day to create products that...  ...providing a self-serve, unified platform with foundational infrastructure in model training, inference and agentic AI, as well as associated... 
    Immediate start
    Relocation

    Apple

    Cupertino, CA
    2 days ago
  • $153.2k - $234.1k

     ...General Motors. Our team is developing and deploying machine learning solutions that support safe and reliable autonomous...  ...across real-world scenarios. As a Senior ML engineer, you will build critical infrastructure that powers every machine learning engineer working... 
    Remote work
    Relocation package
    Flexible hours

    General Motors

    Mountain View, CA
    1 day ago
  • $181.1k - $272.1k

     ...ML Infrastructure Engineer - Multimodal Training Tools, SIML Work Locations (2) Submit Resume Are you passionate about Generative AI...  ...organization. The team operates at the intersection of multimodal machine learning and system experiences. Our multidisciplinary ML teams... 
    Relocation

    Apple

    Cupertino, CA
    3 days ago
  • $170.7k - $300.2k

     ...A leading technology firm in Cupertino is seeking engineers to develop scalable machine learning approaches for autonomous systems. Candidates should possess a strong background in ML modeling frameworks, GPU computing, and software engineering. Responsibilities include... 

    Career-Mover

    Cupertino, CA
    23 hours ago
  • $181.1k - $318.4k

     ...Apple Inc. is looking for a Senior Machine Learning Engineer for the Siri Speech team in Cupertino, California, to enhance the technology used in speech recognition. The role involves working with large datasets and optimizing data processing to improve model training.... 

    Apple

    Cupertino, CA
    1 day ago
  •  ...time trading, all backed by robust data infrastructure. The Role Arta is building the AI...  ...# System Design Interview with VP of Engineering, 60m # Co-founder Interview with Head...  ..., collaboration, and continuous learning are highly valued ~ The opportunity... 
    Work at office
    Remote work
    Relocation

    Arta Finance

    Mountain View, CA
    3 days ago
  • $230k - $300k

     ...with the right team to fulfill our mission: building the infrastructure layer for content intelligence. If you're inspired to...  ...information, visit About the Role We are seeking a Staff Machine Learning Engineer to provide technical leadership for our recommendation... 
    Full time
    Local area
    Work from home

    NewsBreak

    Mountain View, CA
    2 days ago
  •  ...industrial environments. Our ability to iterate quickly on large-scale models depends on world-class ML infrastructure. We’re looking for a Machine Learning Infrastructure Engineer to build the core systems that enable fast, reliable, and scalable model training—powering... 

    Mind Robotics Inc.

    Palo Alto, CA
    22 hours ago
  •  ...industry experience (including 4+ years in the U.S.) ~ Strong foundation in machine learning, deep learning, and computer vision ~ Experience with distributed systems and scalable ML infrastructure ~ Proficient in Python and software development best practices ~... 

    Saxon Global

    Atherton, CA
    1 day ago
  •  ...Location Palo Alto Employment Type Full time Location Type On-site Department Software Engineering We’re hiring Machine Learning Infrastructure Engineers to build the systems that make large-scale model training actually work. This role is for people who enjoy operating... 
    Full time

    Garuda Ventures

    Palo Alto, CA
    1 day ago
  •  ...efforts. We’re proud to serve as the infrastructure platform for teams developing autonomous...  ...validation of state-of-the-art (SOTA) machine learning models, with a focus on performance,...  ...seeking a Senior ML Infrastructure engineer to help build and scale robust Compute... 
    Local area
    Work from home

    General Motors

    Mountain View, CA
    6 days ago
  • $181.1k - $318.4k

     ...Senior ML Infrastructure Engineer, Proactive The Intelligence Platform team empowers clients across Apple's operating systems with high...  ...state of the art technologies like generative AI, graph machine learning, and private learning to be used by hundreds of teams across... 
    Worldwide
    Relocation

    Apple

    Cupertino, CA
    5 days ago
  • $119.22k - $187.5k

    **The Role:** We are looking for a Software Engineer to join our team and help us scale our platform for performance, reliability,...  ...Ray framework- Experience with Kubernetes at Scale- Experience infrastructure applications or similar experience**Compensation:** The... 
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    23 hours ago
  •  ...autonomous vehicle (AV) efforts. We provide an infrastructure platform for teams developing...  ...validation of state‑of‑the‑art (SOTA) machine learning models with an emphasis on performance...  ...are seeking a Senior ML Infrastructure Engineer to build and scale robust compute... 
    Local area
    Work from home

    General Motors

    Sunnyvale, CA
    23 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Machine Learning Infrastructure Engineer. Be the first to apply!