Distributed Training Engineer
Periodic Labs
Periodic Labs Job Posting
We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identity and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission.
About the Role
You will optimize, operate and develop large-scale distributed LLM training systems that power AI scientific research. You will work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows. You will build tools and directly support frontier-scale experiments to make Periodic Labs the world's best AI + science lab for physicists, computational materials scientists, AI researchers, and engineers. You will contribute open-source large scale LLM training frameworks.
You might thrive in this role if you have experience with:
- Training on clusters with ≥5,000 GPUs
- 5D parallel LLM training
- Distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan
- Optimizing training throughput for large scale Mixture-of-Expert models
- A leading robotics company in Palo Alto seeks a Staff/Principal ML Systems Engineer to enhance training performance for their innovative humanoid robots. You will optimize distributed training systems and engage closely with researchers to transform model changes into...Training
$180k
...We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team, closely collaborating with researchers to build the platforms for training our next generation of foundation models. \n Responsibilities Work...TrainingFull timeWork experience placement- A leading AI infrastructure company in California seeks a Member of Technical Staff — Training to design and optimize large-scale distributed training systems for frontier AI models. Candidates should have 5+ years of experience in ML systems and be proficient in Python...Training
$130k - $165k
...Senior/Staff Software Engineer At Forterra, we are unleashing autonomy at scale to transform... ...lives at risk. Our systems operate with distributed control, dynamic routing, and real-time... ...work experience, education, specialized training, critical expertise, training, and more....TrainingFull timeTemporary workWork experience placementLocal area$200k - $400k
...ultra-scale GPU supercomputing systems to train next-generation foundation models. We... ...effort — driving communication performance, distributed reliability, and cross-layer... ...We are looking for a deeply technical engineer to co-design and optimize the communication...TrainingFull timeVisa sponsorship$110k - $130k
...Description We're ALTEN Technology USA, an engineering company helping clients bring... ...doers to join us. As a Low Voltage Distribution Validation Engineer, you will be responsible... ...knowledge, qualifications, skills, education, training, and experience ALTEN Technology...TrainingFor contractors$170k - $260k
...a collective of visionary scientists, engineers, and entrepreneurs are dedicated to transforming... ...a new era of biomedicine, with our LBM training leading to ground-breaking advancements... ...utilization and efficiency. Distributed/Parallel Training: Implement distributed...TrainingWork at office- ...Runtime Engineer The era of pervasive AI has arrived. In this era, organizations will... ...observability We build a high performance, distributed and scalable software execution... ...support data-flow applications such as ML training and inference and HPC applications. We...Training
$89.01k - $170.63k
**Welcome!**.Facilities Power Distribution Electrical Engineer page is loaded## Facilities Power Distribution Electrical Engineerlocations: US, California... ...to validate we’ve received the required documentation, training and systems manuals to help maintain the facilities...TrainingFor contractorsLocal areaImmediate startShift work$176k - $420k
...level Python (including Numpy and Pytorch) Experience with distributed deep learning systems Exposure to robot learning through tactile... .../or vision-based sensors is a plus Proven track record of training and deploying real world neural networks Compensation...TrainingHourly payFull timeTemporary workFlexible hours$215k - $250k
...Onehouse Data Infrastructure Engineer Onehouse is a mission-driven company dedicated... ...created large-scale data systems and globally distributed platforms that sit at the heart of some... ...experience, relevant certifications and training, business needs, market demands and...TrainingOdd jobWork at officeLocal areaRemote workRelocationRelocation package$140k - $312k
...expertise in machine learning, numerical optimization, software engineering, distributed systems, electricity markets, and trading. We have a proven... ..., CAISO, PJM, AEMO, UK National Grid). Prefer academic training in numerical optimization, operations research, stochastic...TrainingHourly payTemporary workWorldwideFlexible hours$150k - $300k
## Distinguished Engineer, Applied AIApplylocations: Palo Alto, CAtime type: Full timeposted... ...technical capabilities across AI/ML, distributed systems, and operational excellence while... ...’s work experience, education and training, the work location as well as market and...TrainingHourly payWork experience placementLocal areaFlexible hoursShift work$80k - $160k
GEICO . For more information, please .Engineer II page is loaded## Engineer IIremote type... ..., design, and build scalable, resilient distributed systems* Engage in cross-functional collaboration... ...’s work experience, education and training, the work location as well as market and...TrainingHourly payWork experience placementInternshipLocal areaFlexible hoursShift work$166k - $225k
...to improve their business. Founded by engineers — and customer obsessed — we leap at every... ...will be building the next generation distributed data storage and processing systems that... ..., relevant certifications and training, and specific work location. Based on the...TrainingLocal areaWorldwide$192k - $260k
Staff Software Engineer - Distributed Data Systems Mountain View, California P-186 At Databricks, we are obsessed with enabling data teams... ...related skills, depth of experience, relevant certifications and training, and specific work location. Based on the factors above,...TrainingWork at officeLocal area$180k
...optimize massive GPU clusters, ensuring fast and reliable AI training. Ideal candidates will possess deep programming skills, GPU kernel... ...optimization experience, and a strong grasp of large-scale distributed systems. This role offers a competitive salary range of $180,0...Training$300k - $400k
...systems layer that makes our frontier model training and inference fast, efficient, and... ...kernels, communication primitives, or distributed training collective operations Profiling... ...of the world's best — the scientists, engineers, and problem-solvers who don't just...TrainingVisa sponsorshipFlexible hoursShift work$140k - $185k
...lives at risk. Our systems operate with distributed control, dynamic routing, and real-time... ...We are seeking a Senior Network Systems Engineer to deploy, operate, and troubleshoot Vektor... ...work experience, education, specialized training, critical expertise, training, and more....TrainingFull timeTemporary workWork experience placementLocal areaRemote work$188.5k - $282.7k
Rubrik, Inc. is seeking a Senior Software Engineer for its Atlas Distributed Systems team. You'll design and deliver innovative solutions for cloud storage while guiding architectural trends within our distributed file systems. The ideal candidate has a degree in Computer...$160.85k - $178k
...glance The WR & CI Group is responsible for the engineering, operation and maintenance of the water distribution, storm drainage and sanitary sewer systems, and roads... ...or global leader speak. $6,000+ in tuition and training assistance annually. Up to 50% of Stanford's...TrainingFor contractorsWork at officeImmediate start- ...environments and handling scenarios unseen in training. We work at the intersection of large-... ...'re hiring a Staff/Principal ML Systems Engineer to own training performance end-to-end... ...GPU counts Drive measurable gains in: Distributed efficiency (overlap, bucket sizing, rank...Training
- ...environments and handling scenarios unseen in training. We work at the intersection of large-... ...verification and validation Define, engineer, deploy, and employ system safety... ...architectures for compute, networking, and power distribution Why This Role Define the safety...Training
$180k - $210k
...About the job As part of the Electrical Engineering team, you will lead the design,... ...architectures (300V+), including power distribution, energy storage systems, and power conversion... ...mitigating risks through assessments and training, encouraging open dialogue on safety...TrainingWork at office$172k - $225.7k
...business value. The Security Applied Field Engineering (AFE) organization is at the forefront... ...Secure Sandboxing to protect sensitive training and inference data. Platform... ...techniques including logging, monitoring, and distributed tracing on a platform level....TrainingFlexible hours$207k - $300k
Site Reliability Engineering Manager, Google Distributed Cloud Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent... ...‑related skills, experience, and relevant education or training. Your recruiter can share more about the specific...TrainingFull time- A leading cybersecurity firm in Palo Alto is seeking a Senior / Principal Software Engineer. The role focuses on developing and maintaining distributed systems and databases to enhance security features. Candidates must have over 5 years of experience in software development...
$48.5 - $64.31 per hour
...level and purpose of the job. The Project Engineer plans, organizes, manages, and is responsible for all document control and distribution processes and systems for the Planning,... ..., participation in ongoing education and training, communication and adherence to safety...TrainingHourly payContract workFor contractorsWork experience placementWork at office$230k - $360k
...Lead Infrastructure And Reliability Engineer (Systems & Scale) Our Infrastructure Engineering... ...make heroics unnecessary Scaling Training & Inference Define how... ...Required: Deep expertise in Linux and distributed systems Experience operating GPU / accelerator...Training- ...Kernel Engineer Tilde Research is a moonshot AI lab advancing mechanistic interpretability, new architectures, and pretraining science... ...high-performance GPU kernels that are critical to scaling our training and inference workloads. Your work will enable faster iteration...TrainingFull timeInternship
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Distributed Training Engineer. Be the first to apply!



