Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Distributed Systems Engineer

$200k - $400k

Institute of Foundation Models

About the Institute of Foundation Models

The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.

This role sits at the core of that effort - driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.

The Mission

We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.

This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.

• Design and optimize expert-parallel and hybrid-parallel communication patterns

• Drive high-performance hierarchical collectives for MoE workloads

• Co-design runtime orchestration with communication topology awareness

• Reduce tail latency and improve determinism across thousands of GPUs

• Architect fault-tolerant distributed execution under real-world cluster failures

Core Technical Scope

• Communication-compute overlap and topology-aware collective optimization

• Deep debugging of NCCL, RDMA, and custom communication layers

• Hybrid expert parallel strategies in modern large-scale MoE systems

• Elastic and resilient distributed job orchestration concepts

• Congestion analysis and routing optimization across InfiniBand/RoCE fabrics

• Microbenchmarking and performance modeling for communication-heavy workloads

Expected Technical Depth

• Hybrid expert parallel communication for Mixture-of-Experts training

• Scaling behavior under network pressure

• Distributed orchestration for elastic, large-scale training

• Fault detection and recovery in distributed GPU workloads

• Cross-layer bottlenecks: GPU NIC PCIe NVSwitch Fabric Scheduler

Required Background

• Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)

• Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA

• Deep familiarity with NCCL and/or UCX internals

• Strong systems programming ability (C/C++, Rust, or Go)

• Strong familiarity with modern model training frameworks such as PyTorch

• Ability to troubleshoot and profile training performance issues related to communication bottlenecks

• Ability to translate research ideas into production-grade optimizations

• Experience debugging distributed hangs, desynchronization, and performance regressions

What We Mean by "Hardcore"

• You can explain why an communication degrades at scale and how to fix it

• You have improved real cluster throughput via communication redesign

• You can trace a distributed hang across ranks and identify the root cause

• You are comfortable working at the boundary between hardware and runtime

Application Requirements

• Include a link to your GitHub (required)

• Provide links to relevant distributed systems, HPC, or large-scale training projects

• Include a list of publications and/or public technical reports (if applicable)

• Describe the hardest distributed debugging problem you solved

• Include measurable performance improvements you have delivered

Academic Qualifications

Master's, or Bachelor's + 1 year of relevant experience.

$200,000 - $400,000 a year

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability
Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Senior Distributed Systems Engineer in Sunnyvale, CA vacancy
  • Apple Inc. in Cupertino, California is looking for a Distributed Systems Engineer to develop the infrastructure for their core Siri Agentic Evaluation Platform. The role involves implementing distributed systems for agent simulations and maintaining high-performance APIs... 
    Senior

    Apple Inc.

    Cupertino, CA
    2 days ago
  •  ...Senior Distributed Storage System Engineer This role has been designed as 'Onsite' with an expectation that you will primarily work from an HPE office. Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help... 
    Senior
    Work at office
    Local area

    Hewlett Packard Enterprise

    Alviso, CA
    17 days ago
  • $168k - $322k

    A leading technology firm is hiring a Senior Software Engineer for Distributed Systems in California. This role involves designing and implementing a factory pipeline for AI models, collaborating with various teams to improve infrastructure, and mentoring team members.... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • Moveworks is seeking a Senior Software Engineer to develop the runtime infrastructure for AI agents. The role focuses on distributed systems engineering, requiring expertise in managing orchestration and real-time responses. Ideal candidates have over 5 years of backend... 
    Senior
    Flexible hours

    Moveworks

    Mountain View, CA
    3 days ago
  •  ...Ll Oefentherapie is seeking a skilled engineer for the Storage Infrastructure team. Your role will include software design and development...  ...ideal candidate should have over 5 years of experience with distributed systems, proficient in programming with Java and Python, and a solid... 
    Senior

    Ll Oefentherapie

    Santa Clara, CA
    4 days ago
  • NVIDIA Gruppe in Santa Clara seeks a Software Engineer to join the Managed AI Research Superclusters team. You'll design and operate...  .... The ideal candidate has over 5 years of experience in distributed systems, excellent programming skills in C++, Python or Go, and a strong... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  •  ...Nuro, based in Mountain View, is seeking senior engineers to build and scale its large-scale computing infrastructure. The role involves...  ...critical applications. The ideal candidate has experience with distributed applications and holds a bachelor's degree in Computer... 
    Senior

    I did my part and supported the Regular Toilet

    Mountain View, CA
    4 days ago
  •  ...in AI-powered data security, seeks a Senior Engineering Manager in Santa Clara, CA. You will lead...  ...teams to design and build large-scale systems while mentoring developers and driving...  ...experience and expertise in distributed systems and agile methodologies. Full-... 
    Senior
    Full time
    Remote work

    Madrona Venture Labs

    Santa Clara, CA
    7 hours ago
  •  ...Ll Oefentherapie is seeking a Senior Principal Software Developer in Santa Clara, California. This role entails leading the design and operation of high-scale distributed systems while mentoring engineers within the team. Applicants should have over 10 years of software... 
    Senior

    Ll Oefentherapie

    Santa Clara, CA
    4 days ago
  • Netflix, Inc. in Los Gatos is seeking a Senior Technical Leader for the Ad Eventing team. In this role, you will drive the technical...  ...ideal candidate will have over 10 years of experience in distributed systems and backend services, along with a deep understanding of... 
    Senior

    Netflix, Inc.

    Los Gatos, CA
    3 days ago
  •  ...A leading robotics company in Palo Alto seeks a Staff/Principal ML Systems Engineer to enhance training performance for their innovative humanoid robots. You will optimize distributed training systems and engage closely with researchers to transform model changes into... 
    Senior

    Rhoda AI

    Palo Alto, CA
    4 days ago
  • $163k - $237k

     ...5 years of experience with systems automation, systems design,...  ...systems including SAN, NAS, or distributed file systems. Experience...  ...the job Systems Development Engineering (SDE) at Google is a role where...  ...operations. We are hiring a Senior Systems Development Engineer... 
    Senior

    Google Inc.

    Sunnyvale, CA
    6 days ago
  • A leading technology company is seeking a Senior Software Engineer specializing in Distributed Build Systems. In this role, you will work on critical projects optimizing development processes and mentoring engineers. You will lead initiatives to enhance performance and... 
    Senior

    Apple Inc.

    Cupertino, CA
    5 days ago
  • $168k - $270.25k

     ...Qualifications History of using advanced programming skills to build distributed and compute systems, backend services, microservices and cloud technologies....  ...cloud systems. BS or MS in Computer Science, Computer Engineering or related field (or equivalent experience). 8+ years of... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • NVIDIA Gruppe is seeking a skilled professional to develop a factory pipeline for AI models and build deployable services across multiple environments. The ideal candidate will have over 8 years of experience in microservices, robust programming skills, and a passion for...
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $2,000 per month

     ...help organizations deliver on the promise of AI. What is The Role: We are on the lookout for a Senior Software Engineer to join our Elasticsearch - Distributed Systems team and focus on how Elasticsearch provides scale, performance, and resilience. This team owns... 
    Senior
    Local area
    Flexible hours

    Elastic

    Mountain View, CA
    4 days ago
  • $168k - $270.25k

    Senior Software Engineer, Distributed Systems - NIM Factory page is loaded## Senior Software Engineer, Distributed Systems - NIM Factorylocations: US, CA, Santa Clara: US, TX, Remote: US, NY, Remote: US, CA, Remotetime type: Full timeposted on: Posted Todayjob requisition... 
    Senior
    Remote work

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  •  ...Dormont Manufacturing Co is seeking a Principal Distributed Systems Research Engineer located in Sunnyvale, California. In this role, you will innovate and research solutions to address significant challenges in distributed systems, contributing to autonomous vehicles,... 

    Dormont Manufacturing Company

    Sunnyvale, CA
    4 days ago
  • $181.1k - $318.4k

    Senior Software Engineer - Distributed Build Systems Cupertino, California, United States Software and Services Apple's distributed build platform is central to the development and delivery of every operating system and app we ship. Tens of thousands of engineers rely... 
    Senior
    Immediate start
    Relocation

    Apple Inc.

    Cupertino, CA
    5 days ago
  • $160.36k - $240.54k

     ...future. About the Role We’re looking for senior engineers to build/scale Nuro's large-scale...  ...infrastructure in the cloud/data center. This system is the foundation of many critical...  ...in building and developing large-scale distributed applications (e.g. Kubernetes). You’re... 
    Senior

    Icehouseventures

    Mountain View, CA
    3 days ago
  • $163k - $237k

    Google Inc. is seeking a Senior Systems Development Engineer in Sunnyvale, California, to manage services and systems at scale, emphasizing operational efficiency and automation. Candidates should have a Bachelor's degree in Computer Science or a related field, plus several... 
    Senior

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • $147k - $184k

     ...future of mobility, then read on! We are looking for a Senior Systems Engineer to join our team. The goal of a Systems Engineer focused on...  ...infrastructure and be responsible for the collection and distribution of data between our aircraft and ground systems. You will... 
    Senior
    Full time
    Work at office
    Local area
    Immediate start
    Remote work
    Flexible hours
    3 days per week

    Wisk Aero

    Mountain View, CA
    1 day ago
  • Staff Runtime Systems Engineer Hybrid, working onsite at our Santa Clara, CA, headquarters 3 days per week. What You Will Do d-Matrix is...  ...architect, develop and debug systems software Experience in distributed and scale-out application Deliver quality code; debug... 
    Senior
    3 days per week

    d-Matrix inc.

    Santa Clara, CA
    3 days ago
  • $166k - $225k

     ...data insights to improve their business. Founded by engineers — and customer obsessed — we leap at every opportunity...  ...Databricks, you will be building the next generation distributed data storage and processing systems that can outperform specialized SQL query engines in... 
    Senior
    Local area
    Worldwide

    Databricks Inc.

    Mountain View, CA
    2 days ago
  • At Databricks in Mountain View, we are seeking a Performance Engineer to enhance product performance and scalability. You will collaborate with teams to identify bottlenecks and optimize efficiency across our data infrastructure. The ideal candidate will have a strong... 
    Senior
    Flexible hours

    I did my part and supported the Regular Toilet

    Mountain View, CA
    3 days ago
  • $174k - $252k

    Google Inc. is seeking a Senior Software Engineer to join the Persistent Disk team in Sunnyvale, CA. The role involves designing high-performance...  ...of development experience, particularly in large-scale systems and distributed computing. The position offers a competitive salary... 
    Senior

    Google Inc.

    Sunnyvale, CA
    6 days ago
  • NVIDIA Gruppe is seeking a Senior Systems Software Engineer to join their advanced infrastructure software team in Santa Clara, California. The role...  ...development, hardware integration, and building distributed systems. The ideal candidate has a Bachelor's or Master'... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $125k - $135k

     ...How You Will Impact Skylo We are seeking a highly skilled System Test Engineer to design and execute quality assurance strategies for Skylo...  ...validation engineering or acceptance testing for large‑scale, distributed, or telecommunications systems. Strong understanding of 3... 
    Senior
    Work at office
    Local area
    3 days per week

    Zoomcar

    Mountain View, CA
    7 hours ago
  • Centaur Labs is seeking an experienced engineer to join the team in Mountain View, California. The role involves building the runtime infrastructure for Moveworks' AI agents, focusing on distributed systems engineering. Ideal candidates should have strong capabilities in... 
    Work at office
    Remote work
    Flexible hours

    Centaur Labs

    Mountain View, CA
    6 days ago
  • $203.45k - $344.3k

     ...Senior Staff Digital World System Engineer Santa Clara, CA XPENG is a leading smart technology company at the forefront of innovation, integrating...  ...systems. Have R&D experience in large-scale distributed systems, data generation platforms or high-performance... 
    Senior
    Full time

    XPENG

    Santa Clara, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Distributed Systems Engineer. Be the first to apply!