Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Distributed Systems Engineer

$200k - $400k

Institute of Foundation Models

Institute Of Foundation Models Engineer

The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.

This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.

The Mission

We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.

This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.

· Design and optimize expert-parallel and hybrid-parallel communication patterns

· Drive high-performance hierarchical collectives for MoE workloads

· Co-design runtime orchestration with communication topology awareness

· Reduce tail latency and improve determinism across thousands of GPUs

· Architect fault-tolerant distributed execution under real-world cluster failures

Core Technical Scope

· Communication-compute overlap and topology-aware collective optimization

· Deep debugging of NCCL, RDMA, and custom communication layers

· Hybrid expert parallel strategies in modern large-scale MoE systems

· Elastic and resilient distributed job orchestration concepts

· Congestion analysis and routing optimization across InfiniBand/RoCE fabrics

· Microbenchmarking and performance modeling for communication-heavy workloads

Expected Technical Depth

· Hybrid expert parallel communication for Mixture-of-Experts training

· Scaling behavior under network pressure

· Distributed orchestration for elastic, large-scale training

· Fault detection and recovery in distributed GPU workloads

· Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler

Required Background

· Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)

· Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA

· Deep familiarity with NCCL and/or UCX internals

· Strong systems programming ability (C/C++, Rust, or Go)

· Strong familiarity with modern model training frameworks such as PyTorch

· Ability to troubleshoot and profile training performance issues related to communication bottlenecks

· Ability to translate research ideas into production-grade optimizations

· Experience debugging distributed hangs, desynchronization, and performance regressions

What We Mean by "Hardcore"

· You can explain why an communication degrades at scale and how to fix it

· You have improved real cluster throughput via communication redesign

· You can trace a distributed hang across ranks and identify the root cause

· You are comfortable working at the boundary between hardware and runtime

Application Requirements

· Include a link to your GitHub (required)

· Provide links to relevant distributed systems, HPC, or large-scale training projects

· Include a list of publications and/or public technical reports (if applicable)

· Describe the hardest distributed debugging problem you solved

· Include measurable performance improvements you have delivered

Academic Qualifications

Master's, or Bachelor's + 1 year of relevant experience.

$200,000 - $400,000 a year

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Senior Distributed Systems Engineer in Sunnyvale, CA vacancy
  •  ...Senior Distributed Storage System Engineer This role has been designed as 'Onsite' with an expectation that you will primarily work from an HPE office. Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help... 
    Senior
    Work at office
    Local area

    Hewlett Packard Enterprise

    Alviso, CA
    3 hours ago
  • $168k - $270.25k

     ...Senior Engineer For Factory Infrastructure And Automation NVIDIA is the platform upon which every new AI-powered application...  ...of using your advanced programming skills to build distributed and compute systems, backend services, microservices and cloud technologies... 
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $147.4k - $272.1k

     ...Senior Software Engineer - Distributed Systems Our team is on a mission to build innovative infrastructure and tools to help other engineers be more productive and make software easier to create, build and distribute. We believe that streamlining software engineering... 
    Senior
    Relocation

    Apple

    Cupertino, CA
    4 days ago
  • $160.36k - $240.54k

     ...Senior Software Engineer, Distributed Compute System Mountain View, California (HQ) Who We Are Nuro is a self-driving technology company on a mission to make autonomy accessible to all. Founded in 2016, Nuro is building the world's most scalable driver, combining... 
    Senior

    Nuro

    Mountain View, CA
    1 day ago
  • $181.1k - $318.4k

     ...Senior Systems Framework Engineer, Vision Products Group Apple is where individual imaginations gather together, committing to the values that...  ...performance of Vision products through the development of distributed systems and frameworks with a broad range of... 
    Senior
    Relocation

    Apple

    Sunnyvale, CA
    1 day ago
  • $147k - $184k

     ...future of mobility, then read on! We are looking for a Senior Systems Engineer to join our team. The goal of a Systems Engineer focused on...  ...infrastructure and be responsible for the collection and distribution of data between our aircraft and ground systems. You will... 
    Senior
    Full time
    Work at office
    Local area
    Immediate start
    Remote work
    Flexible hours
    3 days per week

    Wisk Aero

    Mountain View, CA
    4 days ago
  • $203.45k - $344.3k

     ...Senior Staff Digital World System Engineer Santa Clara, CA XPENG is a leading smart technology company at the forefront of innovation, integrating...  ...systems. Have R&D experience in large-scale distributed systems, data generation platforms or high-performance... 
    Senior
    Full time

    XPENG

    Santa Clara, CA
    2 days ago
  • $190k - $240k

     ...technology company is seeking an experienced backend software engineer to enhance their lifecycle-orchestrator service. The...  ...engineering experience, proficiency in API design, and knowledge of distributed systems. The position supports remote work, ensuring flexibility... 
    Senior
    Remote work

    Affirm

    Palo Alto, CA
    8 days ago
  •  ...Energy Systems Integration Engineer The Opportunity Ready to power the future of energy? Join Hitachi Energy as an Energy Systems Integration...  ...a global team that's transforming how energy is managed, distributed, and optimized. This is your chance to work on cutting-... 
    Senior

    Hitachi

    Santa Clara, CA
    21 hours ago
  •  ...at massive scale from the live web by a distributed crawl platform you'll help build and operate...  ..., high-impact team responsible for a system that continuously fetches, renders, and...  ...Apple Maps, and more. We're looking for an engineer who doesn't just build distributed... 
    Senior

    Apple

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

     ...Automation Engineer NVIDIA's platform and innovations help developers bring artificial...  ...rollback—supporting internal releases and distribution (e.g., NGC). What We Need To See...  ...impact on CI/CD, Release automation, build systems, test infrastructure, or developer... 
    Senior

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $272k - $336k

     ...Senior Staff Regulatory and Compliance Systems Engineer Waymo is an autonomous driving technology company with the mission to be the world's most trusted...  ...ability to extract and process data from massive, distributed relational/non-relational databases ~ Proven experience... 
    Senior
    Odd job
    Full time
    Remote work

    Waymo

    Mountain View, CA
    1 day ago
  • $224k - $356.5k

     ...At NVIDIA, our Financial Systems Engineering team is at the heart of ensuring that our massive scale operates with zero friction. We are...  ...transactional integrity, idempotency, and financial accuracy across distributed systems. This team is not just about IT support; we are the... 
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $124k - $258k

     ...What to Expect We're seeking a highly skilled and collaborative Senior distributed systems engineer to architect and implement a cutting-edge data platform while leading the development of streaming data pipelines, data lake, and OLAP infrastructure to support the growth... 
    Senior
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    3 hours ago
  • $153k - $242k

     ...Senior Systems Engineer, OS Automation CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform...  ...our massive fleet of GPU-accelerated servers. ~ Kernel Distribution: Collaborate with kernel engineers to package, validate,... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Local area
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    2 days ago
  • A leading tech firm in Sunnyvale, California seeks an expert in advanced imaging technology to develop camera simulation tools and methodologies. The ideal candidate will have a MS/PhD in a related field and experience in image processing, ISP pipelines, and camera calibration...
    Senior
    Flexible hours

    DigitalFish Inc

    Sunnyvale, CA
    3 days ago
  •  ...DeepSight Technology is seeking a Senior Systems Engineer with a strong focus on Test Development and System Integration to support the development of our next-generation medical imaging platforms. This role is critical to ensuring that system-level requirements, risk... 
    Senior

    DeepSight Technology

    Santa Clara, CA
    3 days ago
  • $184k - $287.5k

     ...NVIDIA is now looking for a Senior Memory System Engineer to join our ASIC Memory Subsystem team! As a Senior Systems Engineer at NVIDIA, you'll join a group of hardworking engineers to develop and architect innovative Memory Solution for Tegra SoCs. In this position,... 
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  •  ...technology company that has developed integrated photonic system-on-chip technology for next generation navigation devices....  ...segments. Job Summary: We are seeking a Senior Systems Engineer with a strong hardware foundation and expertise in navigation... 
    Senior
    Permanent employment

    Anello Photonics

    Santa Clara, CA
    1 hour ago
  •  ...Senior Systems Engineer Graphcore is one of the world's leading innovators in Artificial Intelligence compute. It is developing hardware...  ...data center facilities, including liquid cooling and power distribution systems. Experience using Python, Bash, or automation tools... 
    Senior

    Graphcore

    Milpitas, CA
    1 day ago
  •  ...Senior HW Systems Engineer (EE) Santa Clara, CA About Anello Photonics: ANELLO Photonics is a leading-edge technology company based in Santa Clara, CA. The company has developed integrated photonic system-on-chip and AI technology for next generation navigation... 
    Senior
    Permanent employment
    Contract work

    Anello Photonics

    Santa Clara, CA
    2 days ago
  • $154.9k - $209.6k

     ...Systems Engineer Aeva's mission is to bring the next wave of perception to a broad range of applications from automated driving to industrial robotics, consumer electronics, consumer health, security, and beyond. Aeva is transforming autonomy with its groundbreaking... 
    Senior
    Flexible hours

    Aeva, Inc

    Mountain View, CA
    3 hours ago
  • $168k - $264.5k

     ...Are you passionate about DGX system connecting multiple ASIC chips together and FPGA prototyping...  ..., we are looking for hardworking systems engineers who will craft FPGA prototypes of our...  ...platforms. We are now looking for a Senior Systems Prototyping Engineer to join our... 
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $168k - $264.5k

     ...Are you passionate about DGX systems connecting multiple ASIC chips together, FPGA prototyping...  ...we are looking for hardworking systems engineers who will craft FPGA prototypes and...  ...emulation platforms. We are now looking for a Senior Systems Prototyping and Emulation... 
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  •  ...Defense, a small fast-growing company that develops and provides advanced multi-function RF systems for the U.S. Department of Defense, is looking for a Senior Systems Engineer to work in our Communications Waveform Development engineering team. As a member of the Pacific... 
    Senior
    Immediate start
    Flexible hours

    Pacific Defense

    Sunnyvale, CA
    3 hours ago
  • $83.9k - $155.7k

     ...timely screening of newborns for diseases and the diagnosis of cancers and infectious diseases. We are seeking a passionate Senior Systems Engineer to join our Systems Development Group to support the Next Generation Sequencing (NGS) products. The candidate is expected... 
    Senior
    Local area
    Relocation package

    F. Hoffmann-La Roche Ltd

    Santa Clara, CA
    1 day ago
  • $89.3k - $157.55k

     ...built for you. What does this role look like? As anfor the Systems Integration and Test Team, you will: • Perform system...  ...interface definition by effectively utilizing standard Electrical Engineering tools (SPICE modeling, Excel macro programming, MATLAB, Python... 
    Senior
    Full time
    Temporary work
    Work experience placement
    Work at office
    Remote work
    Relocation
    Flexible hours
    Shift work

    Lockheed Martin Corporation

    Sunnyvale, CA
    4 days ago
  •  ...Senior Systems Engineer – Mobility (IOS) Sonsoft, Inc. is a USA based corporation duly organized under the laws of the Commonwealth of Georgia. Sonsoft Inc. is growing at a steady pace specializing in the fields of Software Development, Software Consultancy and Information... 
    Senior
    Permanent employment
    Full time

    SonSoft

    Sunnyvale, CA
    3 hours ago
  •  ...Senior Systems Engineer Our client, a leader in biotechnology and diagnostics, is looking for a Senior Systems Engineer based out of Santa Clara, CA. If you are interested in discussing further, please review the details below. Duration: Long term contract (Possibility... 
    Senior
    Long term contract
    Early shift

    Dawar Consulting

    Santa Clara, CA
    3 days ago
  • $200k - $322k

     ...can make a lasting impact on the world. Join NVIDIA's datacenter product engineering team in our Operations organization and be at the forefront of technological advancement! As a Senior System Debug Engineer, you will drive failure analysis and debug efforts during... 
    Senior
    Work experience placement
    Overseas

    NVIDIA

    Santa Clara, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Distributed Systems Engineer. Be the first to apply!