Senior Distributed Systems Engineer
$200k - $400kInstitute of Foundation Models
About the Institute of Foundation Models The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology. This role sits at the core of that effort - driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads. The Mission We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads. This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design. • Design and optimize expert-parallel and hybrid-parallel communication patterns • Drive high-performance hierarchical collectives for MoE workloads • Co-design runtime orchestration with communication topology awareness • Reduce tail latency and improve determinism across thousands of GPUs • Architect fault-tolerant distributed execution under real-world cluster failures Core Technical Scope • Communication-compute overlap and topology-aware collective optimization • Deep debugging of NCCL, RDMA, and custom communication layers • Hybrid expert parallel strategies in modern large-scale MoE systems • Elastic and resilient distributed job orchestration concepts • Congestion analysis and routing optimization across InfiniBand/RoCE fabrics • Microbenchmarking and performance modeling for communication-heavy workloads Expected Technical Depth • Hybrid expert parallel communication for Mixture-of-Experts training • Scaling behavior under network pressure • Distributed orchestration for elastic, large-scale training • Fault detection and recovery in distributed GPU workloads • Cross-layer bottlenecks: GPU NIC PCIe NVSwitch Fabric Scheduler Required Background • Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth) • Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA • Deep familiarity with NCCL and/or UCX internals • Strong systems programming ability (C/C++, Rust, or Go) • Strong familiarity with modern model training frameworks such as PyTorch • Ability to troubleshoot and profile training performance issues related to communication bottlenecks • Ability to translate research ideas into production-grade optimizations • Experience debugging distributed hangs, desynchronization, and performance regressions What We Mean by "Hardcore" • You can explain why an communication degrades at scale and how to fix it • You have improved real cluster throughput via communication redesign • You can trace a distributed hang across ranks and identify the root cause • You are comfortable working at the boundary between hardware and runtime Application Requirements • Include a link to your GitHub (required) • Provide links to relevant distributed systems, HPC, or large-scale training projects • Include a list of publications and/or public technical reports (if applicable) • Describe the hardest distributed debugging problem you solved • Include measurable performance improvements you have delivered Academic Qualifications Master's, or Bachelor's + 1 year of relevant experience. $200,000 - $400,000 a year Visa Sponsorship This position is eligible for visa sponsorship. Benefits Include *Comprehensive medical, dental, and vision benefits *Bonus *401K Plan *Generous paid time off, sick leave and holidays *Paid Parental Leave *Employee Assistance Program *Life insurance and disability
Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Senior Distributed Systems Engineer in Sunnyvale, CA vacancy
- Apple Inc. in Cupertino, California is looking for a Distributed Systems Engineer to develop the infrastructure for their core Siri Agentic Evaluation Platform. The role involves implementing distributed systems for agent simulations and maintaining high-performance APIs...Senior
- ...Senior Distributed Storage System Engineer This role has been designed as 'Onsite' with an expectation that you will primarily work from an HPE office. Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help...SeniorWork at officeLocal area
$168k - $322k
A leading technology firm is hiring a Senior Software Engineer for Distributed Systems in California. This role involves designing and implementing a factory pipeline for AI models, collaborating with various teams to improve infrastructure, and mentoring team members....Senior- Moveworks is seeking a Senior Software Engineer to develop the runtime infrastructure for AI agents. The role focuses on distributed systems engineering, requiring expertise in managing orchestration and real-time responses. Ideal candidates have over 5 years of backend...SeniorFlexible hours
- ...Ll Oefentherapie is seeking a skilled engineer for the Storage Infrastructure team. Your role will include software design and development... ...ideal candidate should have over 5 years of experience with distributed systems, proficient in programming with Java and Python, and a solid...Senior
- NVIDIA Gruppe in Santa Clara seeks a Software Engineer to join the Managed AI Research Superclusters team. You'll design and operate... .... The ideal candidate has over 5 years of experience in distributed systems, excellent programming skills in C++, Python or Go, and a strong...Senior
- ...Nuro, based in Mountain View, is seeking senior engineers to build and scale its large-scale computing infrastructure. The role involves... ...critical applications. The ideal candidate has experience with distributed applications and holds a bachelor's degree in Computer...Senior
- ...in AI-powered data security, seeks a Senior Engineering Manager in Santa Clara, CA. You will lead... ...teams to design and build large-scale systems while mentoring developers and driving... ...experience and expertise in distributed systems and agile methodologies. Full-...SeniorFull timeRemote work
- ...Ll Oefentherapie is seeking a Senior Principal Software Developer in Santa Clara, California. This role entails leading the design and operation of high-scale distributed systems while mentoring engineers within the team. Applicants should have over 10 years of software...Senior
- Netflix, Inc. in Los Gatos is seeking a Senior Technical Leader for the Ad Eventing team. In this role, you will drive the technical... ...ideal candidate will have over 10 years of experience in distributed systems and backend services, along with a deep understanding of...Senior
- ...A leading robotics company in Palo Alto seeks a Staff/Principal ML Systems Engineer to enhance training performance for their innovative humanoid robots. You will optimize distributed training systems and engage closely with researchers to transform model changes into...Senior
$163k - $237k
...5 years of experience with systems automation, systems design,... ...systems including SAN, NAS, or distributed file systems. Experience... ...the job Systems Development Engineering (SDE) at Google is a role where... ...operations. We are hiring a Senior Systems Development Engineer...Senior- A leading technology company is seeking a Senior Software Engineer specializing in Distributed Build Systems. In this role, you will work on critical projects optimizing development processes and mentoring engineers. You will lead initiatives to enhance performance and...Senior
$168k - $270.25k
...Qualifications History of using advanced programming skills to build distributed and compute systems, backend services, microservices and cloud technologies.... ...cloud systems. BS or MS in Computer Science, Computer Engineering or related field (or equivalent experience). 8+ years of...Senior- NVIDIA Gruppe is seeking a skilled professional to develop a factory pipeline for AI models and build deployable services across multiple environments. The ideal candidate will have over 8 years of experience in microservices, robust programming skills, and a passion for...Senior
$2,000 per month
...help organizations deliver on the promise of AI. What is The Role: We are on the lookout for a Senior Software Engineer to join our Elasticsearch - Distributed Systems team and focus on how Elasticsearch provides scale, performance, and resilience. This team owns...SeniorLocal areaFlexible hours$168k - $270.25k
Senior Software Engineer, Distributed Systems - NIM Factory page is loaded## Senior Software Engineer, Distributed Systems - NIM Factorylocations: US, CA, Santa Clara: US, TX, Remote: US, NY, Remote: US, CA, Remotetime type: Full timeposted on: Posted Todayjob requisition...SeniorRemote work- ...Dormont Manufacturing Co is seeking a Principal Distributed Systems Research Engineer located in Sunnyvale, California. In this role, you will innovate and research solutions to address significant challenges in distributed systems, contributing to autonomous vehicles,...
$181.1k - $318.4k
Senior Software Engineer - Distributed Build Systems Cupertino, California, United States Software and Services Apple's distributed build platform is central to the development and delivery of every operating system and app we ship. Tens of thousands of engineers rely...SeniorImmediate startRelocation$160.36k - $240.54k
...future. About the Role We’re looking for senior engineers to build/scale Nuro's large-scale... ...infrastructure in the cloud/data center. This system is the foundation of many critical... ...in building and developing large-scale distributed applications (e.g. Kubernetes). You’re...Senior$163k - $237k
Google Inc. is seeking a Senior Systems Development Engineer in Sunnyvale, California, to manage services and systems at scale, emphasizing operational efficiency and automation. Candidates should have a Bachelor's degree in Computer Science or a related field, plus several...Senior$147k - $184k
...future of mobility, then read on! We are looking for a Senior Systems Engineer to join our team. The goal of a Systems Engineer focused on... ...infrastructure and be responsible for the collection and distribution of data between our aircraft and ground systems. You will...SeniorFull timeWork at officeLocal areaImmediate startRemote workFlexible hours3 days per week- Staff Runtime Systems Engineer Hybrid, working onsite at our Santa Clara, CA, headquarters 3 days per week. What You Will Do d-Matrix is... ...architect, develop and debug systems software Experience in distributed and scale-out application Deliver quality code; debug...Senior3 days per week
$166k - $225k
...data insights to improve their business. Founded by engineers — and customer obsessed — we leap at every opportunity... ...Databricks, you will be building the next generation distributed data storage and processing systems that can outperform specialized SQL query engines in...SeniorLocal areaWorldwide- At Databricks in Mountain View, we are seeking a Performance Engineer to enhance product performance and scalability. You will collaborate with teams to identify bottlenecks and optimize efficiency across our data infrastructure. The ideal candidate will have a strong...SeniorFlexible hours
$174k - $252k
Google Inc. is seeking a Senior Software Engineer to join the Persistent Disk team in Sunnyvale, CA. The role involves designing high-performance... ...of development experience, particularly in large-scale systems and distributed computing. The position offers a competitive salary...Senior- NVIDIA Gruppe is seeking a Senior Systems Software Engineer to join their advanced infrastructure software team in Santa Clara, California. The role... ...development, hardware integration, and building distributed systems. The ideal candidate has a Bachelor's or Master'...Senior
$125k - $135k
...How You Will Impact Skylo We are seeking a highly skilled System Test Engineer to design and execute quality assurance strategies for Skylo... ...validation engineering or acceptance testing for large‑scale, distributed, or telecommunications systems. Strong understanding of 3...SeniorWork at officeLocal area3 days per week- Centaur Labs is seeking an experienced engineer to join the team in Mountain View, California. The role involves building the runtime infrastructure for Moveworks' AI agents, focusing on distributed systems engineering. Ideal candidates should have strong capabilities in...Work at officeRemote workFlexible hours
$203.45k - $344.3k
...Senior Staff Digital World System Engineer Santa Clara, CA XPENG is a leading smart technology company at the forefront of innovation, integrating... ...systems. Have R&D experience in large-scale distributed systems, data generation platforms or high-performance...SeniorFull time
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Distributed Systems Engineer. Be the first to apply!
Related searches
- system engineer contract Sunnyvale, CA
- application system engineer Sunnyvale, CA
- senior windows systems engineer Sunnyvale, CA
- lead system engineer Sunnyvale, CA
- system performance engineer Sunnyvale, CA
- senior staff systems engineer Sunnyvale, CA
- director systems engineering Sunnyvale, CA
- systems engineer Sunnyvale, CA
- computer system validation engineer Sunnyvale, CA
- distributed systems engineer Sunnyvale, CA

