Distributed Training Engineer
Periodic Labs
Periodic Labs Job Posting
We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identity and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission.
About the Role
You will optimize, operate and develop large-scale distributed LLM training systems that power AI scientific research. You will work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows. You will build tools and directly support frontier-scale experiments to make Periodic Labs the world's best AI + science lab for physicists, computational materials scientists, AI researchers, and engineers. You will contribute open-source large scale LLM training frameworks.
You might thrive in this role if you have experience with:
- Training on clusters with ≥5,000 GPUs
- 5D parallel LLM training
- Distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan
- Optimizing training throughput for large scale Mixture-of-Expert models
- A leading robotics company in Palo Alto seeks a Staff/Principal ML Systems Engineer to enhance training performance for their innovative humanoid robots. You will optimize distributed training systems and engage closely with researchers to transform model changes into...Training
- A leading AI infrastructure company in California seeks a Member of Technical Staff — Training to design and optimize large-scale distributed training systems for frontier AI models. Candidates should have 5+ years of experience in ML systems and be proficient in Python...Training
$130k - $165k
...lives at risk. Our systems operate with distributed control, dynamic routing, and real-time... ...We are seeking a Senior/Staff Software Engineer to help design and build Fabric, Forterra... ...work experience, education, specialized training, critical expertise, training, and more....TrainingFull timeTemporary workWork experience placementLocal area$200k - $400k
...Institute Of Foundation Models Engineer The Institute of Foundation Models (IFM) designs... ...-scale GPU supercomputing systems to train next-generation foundation models. We believe... ...— driving communication performance, distributed reliability, and cross-layer...TrainingVisa sponsorship- ...developments, and public infrastructure. Our engineering team plays a critical role in delivering safe, reliable, and efficient power distribution systems for high performance facilities... ...is preferred. Benefits OUR TRAINING PROGRAMS Cost estimating utilizing the...TrainingFor contractorsWork at officeLocal area
$89.01k - $170.63k
...s electrical operations and associated distribution system. Ensure planned and corrective... ...systems, and leads complex electrical engineering projects and initiatives including working... ...received the required documentation, training and systems manuals to help maintain...TrainingFor contractorsLocal areaImmediate startShift work$2,000 per month
...Role: We are on the lookout for a Senior Software Engineer to join our Elasticsearch - Distributed Systems team and focus on how Elasticsearch provides... ...). Proactively participate in mandatory role-based training to ensure personal technical execution consistently...TrainingLocal areaFlexible hours- ...Performance Engineer RadixArk is hiring a Performance Engineer in Palo Alto, CA — someone who can push LLM inference and training systems to the limit across real production workloads.... ...scheduling, batching, kernel behavior, distributed execution, and cost-per-token....TrainingFlexible hours
$215k - $250k
...Onehouse Data Infrastructure Engineer Onehouse is a mission-driven company dedicated... ...created large-scale data systems and globally distributed platforms that sit at the heart of some... ...experience, relevant certifications and training, business needs, market demands and...TrainingOdd jobWork at officeLocal areaRemote workRelocationRelocation package$176k - $420k
...Our reinforcement and imitation learning engineers are responsible for end-to-end robotic... ...Numpy and Pytorch) Experience with distributed deep learning systems Exposure to robot... ...sensors is a plus Proven track record of training and deploying real world neural...TrainingHourly payTemporary workFlexible hours$140k - $312k
...expertise in machine learning, numerical optimization, software engineering, distributed systems, electricity markets, and trading. We have a proven... ..., CAISO, PJM, AEMO, UK National Grid). Prefer academic training in numerical optimization, operations research, stochastic...TrainingHourly payTemporary workWorldwideFlexible hours$101k - $198k
...and ERP system landscape. The NetSuite engineer will work closely with stakeholders across... ...system configurations, procedures, and training materials, to reflect any changes or enhancements... ..., the most widely available, globally distributed data platform on the market, helps...TrainingLocal areaWorldwideFlexible hours$150k - $300k
## Distinguished Engineer, Applied AIApplylocations: Palo Alto, CAtime type: Full timeposted... ...technical capabilities across AI/ML, distributed systems, and operational excellence while... ...’s work experience, education and training, the work location as well as market and...TrainingHourly payWork experience placementLocal areaFlexible hoursShift work$192k - $260k
...growing SaaS companies in the world. Our engineering teams build highly technical products... .... Optional: MS or PhD in databases, distributed systems. Comfortable working towards a... ...experience, relevant certifications and training, and specific work location. Based on the...TrainingWork at officeLocal area$55.85 - $74 per hour
...Join Stanford Health Care as a Project Engineer! Are you ready to make a meaningful impact... ...for all document control and distribution processes and systems for the Planning,... ..., experience, education, specialty and training. This pay scale is not a promise of a particular...TrainingHourly payContract workFor contractorsWork experience placementWork at office- ...Project Engineer For over 40 years, Pete Moffat Construction has earned a reputation... ...project documentation and information distribution while actively supporting the project team... ..., certifications, and any relevant training. This position is not eligible for immigration...TrainingFor subcontractorWork at officeLocal area
$80k - $100k
...The Project Engineer plays a pivotal role in project coordination, overseeing tasks such... ...Drawing Set and logs utilizing Fieldwire Distribute project drawings, design changes, RFIs,... ...~ Sick Time ~ OSHA 10/30 Training ~ Commuter Benefits (CA only) ~ Gym...TrainingFull timeFor contractorsFor subcontractorInternshipWork at officeVisa sponsorshipMonday to FridayFlexible hours$150k - $230k
...Senior Systems Engineer - AI Infrastructure On Site, Palo Alto, California About the Role We're building infrastructure for fault-tolerant, high-performance distributed GPU training. You'll work at the intersection of GPU systems, high-speed networking, and distributed...Training$180k
...optimize massive GPU clusters, ensuring fast and reliable AI training. Ideal candidates will possess deep programming skills, GPU kernel... ...optimization experience, and a strong grasp of large-scale distributed systems. This role offers a competitive salary range of $180,0...Training$300k - $400k
...systems layer that makes our frontier model training and inference fast, efficient, and... ...kernels, communication primitives, or distributed training collective operations Profiling... ...of the world's best — the scientists, engineers, and problem-solvers who don't just...TrainingVisa sponsorshipFlexible hoursShift work$140k - $185k
...lives at risk. Our systems operate with distributed control, dynamic routing, and real-time... ...We are seeking a Senior Network Systems Engineer to deploy, operate, and troubleshoot Vektor... ...work experience, education, specialized training, critical expertise, training, and more....TrainingFull timeTemporary workWork experience placementLocal areaRemote work- ...environments and handling scenarios unseen in training. We work at the intersection of large-... ...'re hiring a Staff/Principal ML Systems Engineer to own training performance end-to-end... ...GPU counts Drive measurable gains in: Distributed efficiency (overlap, bucket sizing, rank...Training
- ...environments and handling scenarios unseen in training. We work at the intersection of large-... ...verification and validation Define, engineer, deploy, and employ system safety... ...architectures for compute, networking, and power distribution Why This Role Define the safety...Training
$60 per hour
...Description Software QA / Test Automation Engineer We are seeking a highly skilled... ...Perform root-cause analysis of complex, distributed system failures. Utilize system logs, network... ...Bonus Programs ~ Certification and training opportunities Note: Any pay ranges...Training$180k - $210k
...About the job As part of the Electrical Engineering team, you will lead the design,... ...architectures (300V+), including power distribution, energy storage systems, and power conversion... ...mitigating risks through assessments and training, encouraging open dialogue on safety...TrainingWork at office- Unconventional AI in Palo Alto seeks a key contributor to develop a next-generation ML model training platform. You will optimize training stacks and design distributed systems, pushing boundaries in computing efficiency. Candidates should possess an MS/PhD in a quantitative...Training
$172k - $225.7k
...business value. The Security Applied Field Engineering (AFE) organization is at the forefront... ...Secure Sandboxing to protect sensitive training and inference data. Platform... ...techniques including logging, monitoring, and distributed tracing on a platform level....TrainingFlexible hours- ...create it. You'll work alongside world-class engineers and researchers to design, prototype,... ...create systems that help our AI teams train better robotic policies. • Currently pursuing... ...• Comfortable designing robust power distribution and management systems with protection...TrainingFull timeInternshipFlexible hours
- ...that fuels it, recursively accelerating the path to artificial superintelligence. We are interested in best-in-class engineers to focus on a variety of challenges relating to scaling, low-level optimization, and core infrastructure for LLM training and inference....Training
- ...Kernel Engineer Tilde Research is a moonshot AI lab advancing mechanistic interpretability, new architectures, and pretraining science... ...high-performance GPU kernels that are critical to scaling our training and inference workloads. Your work will enable faster iteration...TrainingFull timeInternship
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Distributed Training Engineer. Be the first to apply!


