Senior Distributed Systems Engineer
$200k - $400kInstitute of Foundation Models
Institute Of Foundation Models Engineer
The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.
This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.
The Mission
We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.
This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.
· Design and optimize expert-parallel and hybrid-parallel communication patterns
· Drive high-performance hierarchical collectives for MoE workloads
· Co-design runtime orchestration with communication topology awareness
· Reduce tail latency and improve determinism across thousands of GPUs
· Architect fault-tolerant distributed execution under real-world cluster failures
Core Technical Scope
· Communication-compute overlap and topology-aware collective optimization
· Deep debugging of NCCL, RDMA, and custom communication layers
· Hybrid expert parallel strategies in modern large-scale MoE systems
· Elastic and resilient distributed job orchestration concepts
· Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
· Microbenchmarking and performance modeling for communication-heavy workloads
Expected Technical Depth
· Hybrid expert parallel communication for Mixture-of-Experts training
· Scaling behavior under network pressure
· Distributed orchestration for elastic, large-scale training
· Fault detection and recovery in distributed GPU workloads
· Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Required Background
· Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
· Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
· Deep familiarity with NCCL and/or UCX internals
· Strong systems programming ability (C/C++, Rust, or Go)
· Strong familiarity with modern model training frameworks such as PyTorch
· Ability to troubleshoot and profile training performance issues related to communication bottlenecks
· Ability to translate research ideas into production-grade optimizations
· Experience debugging distributed hangs, desynchronization, and performance regressions
What We Mean by "Hardcore"
· You can explain why an communication degrades at scale and how to fix it
· You have improved real cluster throughput via communication redesign
· You can trace a distributed hang across ranks and identify the root cause
· You are comfortable working at the boundary between hardware and runtime
Application Requirements
· Include a link to your GitHub (required)
· Provide links to relevant distributed systems, HPC, or large-scale training projects
· Include a list of publications and/or public technical reports (if applicable)
· Describe the hardest distributed debugging problem you solved
· Include measurable performance improvements you have delivered
Academic Qualifications
Master's, or Bachelor's + 1 year of relevant experience.
$200,000 - $400,000 a year
Visa Sponsorship
This position is eligible for visa sponsorship.
Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability
- ...Senior Distributed Storage System Engineer This role has been designed as 'Onsite' with an expectation that you will primarily work from an HPE office. Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help...SeniorWork at officeLocal area
$168k - $270.25k
...Senior Engineer For Factory Infrastructure And Automation NVIDIA is the platform upon which every new AI-powered application... ...of using your advanced programming skills to build distributed and compute systems, backend services, microservices and cloud technologies...Senior$147.4k - $272.1k
...Senior Software Engineer - Distributed Systems Our team is on a mission to build innovative infrastructure and tools to help other engineers be more productive and make software easier to create, build and distribute. We believe that streamlining software engineering...SeniorRelocation$160.36k - $240.54k
...Senior Software Engineer, Distributed Compute System Mountain View, California (HQ) Who We Are Nuro is a self-driving technology company on a mission to make autonomy accessible to all. Founded in 2016, Nuro is building the world's most scalable driver, combining...Senior$181.1k - $318.4k
...Senior Systems Framework Engineer, Vision Products Group Apple is where individual imaginations gather together, committing to the values that... ...performance of Vision products through the development of distributed systems and frameworks with a broad range of...SeniorRelocation$147k - $184k
...future of mobility, then read on! We are looking for a Senior Systems Engineer to join our team. The goal of a Systems Engineer focused on... ...infrastructure and be responsible for the collection and distribution of data between our aircraft and ground systems. You will...SeniorFull timeWork at officeLocal areaImmediate startRemote workFlexible hours3 days per week$203.45k - $344.3k
...Senior Staff Digital World System Engineer Santa Clara, CA XPENG is a leading smart technology company at the forefront of innovation, integrating... ...systems. Have R&D experience in large-scale distributed systems, data generation platforms or high-performance...SeniorFull time$190k - $240k
...technology company is seeking an experienced backend software engineer to enhance their lifecycle-orchestrator service. The... ...engineering experience, proficiency in API design, and knowledge of distributed systems. The position supports remote work, ensuring flexibility...SeniorRemote work- ...Energy Systems Integration Engineer The Opportunity Ready to power the future of energy? Join Hitachi Energy as an Energy Systems Integration... ...a global team that's transforming how energy is managed, distributed, and optimized. This is your chance to work on cutting-...Senior
- ...at massive scale from the live web by a distributed crawl platform you'll help build and operate... ..., high-impact team responsible for a system that continuously fetches, renders, and... ...Apple Maps, and more. We're looking for an engineer who doesn't just build distributed...Senior
$184k - $287.5k
...Automation Engineer NVIDIA's platform and innovations help developers bring artificial... ...rollback—supporting internal releases and distribution (e.g., NGC). What We Need To See... ...impact on CI/CD, Release automation, build systems, test infrastructure, or developer...Senior$272k - $336k
...Senior Staff Regulatory and Compliance Systems Engineer Waymo is an autonomous driving technology company with the mission to be the world's most trusted... ...ability to extract and process data from massive, distributed relational/non-relational databases ~ Proven experience...SeniorOdd jobFull timeRemote work$224k - $356.5k
...At NVIDIA, our Financial Systems Engineering team is at the heart of ensuring that our massive scale operates with zero friction. We are... ...transactional integrity, idempotency, and financial accuracy across distributed systems. This team is not just about IT support; we are the...Senior$124k - $258k
...What to Expect We're seeking a highly skilled and collaborative Senior distributed systems engineer to architect and implement a cutting-edge data platform while leading the development of streaming data pipelines, data lake, and OLAP infrastructure to support the growth...SeniorHourly payFull timeTemporary workFlexible hours$153k - $242k
...Senior Systems Engineer, OS Automation CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform... ...our massive fleet of GPU-accelerated servers. ~ Kernel Distribution: Collaborate with kernel engineers to package, validate,...SeniorPermanent employmentTemporary workCasual workWork at officeLocal areaRemote workFlexible hours- A leading tech firm in Sunnyvale, California seeks an expert in advanced imaging technology to develop camera simulation tools and methodologies. The ideal candidate will have a MS/PhD in a related field and experience in image processing, ISP pipelines, and camera calibration...SeniorFlexible hours
- ...DeepSight Technology is seeking a Senior Systems Engineer with a strong focus on Test Development and System Integration to support the development of our next-generation medical imaging platforms. This role is critical to ensuring that system-level requirements, risk...Senior
$184k - $287.5k
...NVIDIA is now looking for a Senior Memory System Engineer to join our ASIC Memory Subsystem team! As a Senior Systems Engineer at NVIDIA, you'll join a group of hardworking engineers to develop and architect innovative Memory Solution for Tegra SoCs. In this position,...Senior- ...technology company that has developed integrated photonic system-on-chip technology for next generation navigation devices.... ...segments. Job Summary: We are seeking a Senior Systems Engineer with a strong hardware foundation and expertise in navigation...SeniorPermanent employment
- ...Senior Systems Engineer Graphcore is one of the world's leading innovators in Artificial Intelligence compute. It is developing hardware... ...data center facilities, including liquid cooling and power distribution systems. Experience using Python, Bash, or automation tools...Senior
- ...Senior HW Systems Engineer (EE) Santa Clara, CA About Anello Photonics: ANELLO Photonics is a leading-edge technology company based in Santa Clara, CA. The company has developed integrated photonic system-on-chip and AI technology for next generation navigation...SeniorPermanent employmentContract work
$154.9k - $209.6k
...Systems Engineer Aeva's mission is to bring the next wave of perception to a broad range of applications from automated driving to industrial robotics, consumer electronics, consumer health, security, and beyond. Aeva is transforming autonomy with its groundbreaking...SeniorFlexible hours$168k - $264.5k
...Are you passionate about DGX system connecting multiple ASIC chips together and FPGA prototyping... ..., we are looking for hardworking systems engineers who will craft FPGA prototypes of our... ...platforms. We are now looking for a Senior Systems Prototyping Engineer to join our...Senior$168k - $264.5k
...Are you passionate about DGX systems connecting multiple ASIC chips together, FPGA prototyping... ...we are looking for hardworking systems engineers who will craft FPGA prototypes and... ...emulation platforms. We are now looking for a Senior Systems Prototyping and Emulation...Senior- ...Defense, a small fast-growing company that develops and provides advanced multi-function RF systems for the U.S. Department of Defense, is looking for a Senior Systems Engineer to work in our Communications Waveform Development engineering team. As a member of the Pacific...SeniorImmediate startFlexible hours
$83.9k - $155.7k
...timely screening of newborns for diseases and the diagnosis of cancers and infectious diseases. We are seeking a passionate Senior Systems Engineer to join our Systems Development Group to support the Next Generation Sequencing (NGS) products. The candidate is expected...SeniorLocal areaRelocation package$89.3k - $157.55k
...built for you. What does this role look like? As anfor the Systems Integration and Test Team, you will: • Perform system... ...interface definition by effectively utilizing standard Electrical Engineering tools (SPICE modeling, Excel macro programming, MATLAB, Python...SeniorFull timeTemporary workWork experience placementWork at officeRemote workRelocationFlexible hoursShift work- ...Senior Systems Engineer – Mobility (IOS) Sonsoft, Inc. is a USA based corporation duly organized under the laws of the Commonwealth of Georgia. Sonsoft Inc. is growing at a steady pace specializing in the fields of Software Development, Software Consultancy and Information...SeniorPermanent employmentFull time
- ...Senior Systems Engineer Our client, a leader in biotechnology and diagnostics, is looking for a Senior Systems Engineer based out of Santa Clara, CA. If you are interested in discussing further, please review the details below. Duration: Long term contract (Possibility...SeniorLong term contractEarly shift
$200k - $322k
...can make a lasting impact on the world. Join NVIDIA's datacenter product engineering team in our Operations organization and be at the forefront of technological advancement! As a Senior System Debug Engineer, you will drive failure analysis and debug efforts during...SeniorWork experience placementOverseas
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Distributed Systems Engineer. Be the first to apply!
- operations support system engineer Sunnyvale, CA
- ground systems engineer Sunnyvale, CA
- mission system engineer Sunnyvale, CA
- wireless systems engineer Sunnyvale, CA
- space systems engineer Sunnyvale, CA
- digital communications systems engineer Sunnyvale, CA
- application system engineer Sunnyvale, CA
- system performance engineer Sunnyvale, CA
- adas systems engineer Sunnyvale, CA
- system engineer contract Sunnyvale, CA

