Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Distributed Training Engineer

Periodic Labs

Periodic Labs Job Posting

We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identity and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission.

About the Role

You will optimize, operate and develop large-scale distributed LLM training systems that power AI scientific research. You will work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows. You will build tools and directly support frontier-scale experiments to make Periodic Labs the world's best AI + science lab for physicists, computational materials scientists, AI researchers, and engineers. You will contribute open-source large scale LLM training frameworks.

You might thrive in this role if you have experience with:

  • Training on clusters with ≥5,000 GPUs
  • 5D parallel LLM training
  • Distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan
  • Optimizing training throughput for large scale Mixture-of-Expert models
Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Distributed Training Engineer in Menlo Park, CA vacancy
  • A leading robotics company in Palo Alto seeks a Staff/Principal ML Systems Engineer to enhance training performance for their innovative humanoid robots. You will optimize distributed training systems and engage closely with researchers to transform model changes into... 
    Training

    Rhoda AI

    Palo Alto, CA
    21 hours ago
  • A leading AI infrastructure company in California seeks a Member of Technical Staff — Training to design and optimize large-scale distributed training systems for frontier AI models. Candidates should have 5+ years of experience in ML systems and be proficient in Python... 
    Training

    RadixArk

    Palo Alto, CA
    1 day ago
  • $130k - $165k

     ...lives at risk. Our systems operate with distributed control, dynamic routing, and real-time...  ...We are seeking a Senior/Staff Software Engineer to help design and build Fabric, Forterra...  ...work experience, education, specialized training, critical expertise, training, and more.... 
    Training
    Full time
    Temporary work
    Work experience placement
    Local area

    Forterra, Inc.

    East Palo Alto, CA
    3 days ago
  • $200k - $400k

     ...Institute Of Foundation Models Engineer The Institute of Foundation Models (IFM) designs...  ...-scale GPU supercomputing systems to train next-generation foundation models. We believe...  ...— driving communication performance, distributed reliability, and cross-layer... 
    Training
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    3 days ago
  •  ...developments, and public infrastructure. Our engineering team plays a critical role in delivering safe, reliable, and efficient power distribution systems for high performance facilities...  ...is preferred. Benefits OUR TRAINING PROGRAMS Cost estimating utilizing the... 
    Training
    For contractors
    Work at office
    Local area

    CyberCoders

    Santa Clara, CA
    21 hours ago
  • $89.01k - $170.63k

     ...s electrical operations and associated distribution system. Ensure planned and corrective...  ...systems, and leads complex electrical engineering projects and initiatives including working...  ...received the required documentation, training and systems manuals to help maintain... 
    Training
    For contractors
    Local area
    Immediate start
    Shift work

    Intel

    Santa Clara, CA
    2 days ago
  • $2,000 per month

     ...Role: We are on the lookout for a Senior Software Engineer to join our Elasticsearch - Distributed Systems team and focus on how Elasticsearch provides...  ...). Proactively participate in mandatory role-based training to ensure personal technical execution consistently... 
    Training
    Local area
    Flexible hours

    Elastic

    Mountain View, CA
    1 day ago
  •  ...Performance Engineer RadixArk is hiring a Performance Engineer in Palo Alto, CA — someone who can push LLM inference and training systems to the limit across real production workloads....  ...scheduling, batching, kernel behavior, distributed execution, and cost-per-token.... 
    Training
    Flexible hours

    RadixArk

    Palo Alto, CA
    1 day ago
  • $215k - $250k

     ...Onehouse Data Infrastructure Engineer Onehouse is a mission-driven company dedicated...  ...created large-scale data systems and globally distributed platforms that sit at the heart of some...  ...experience, relevant certifications and training, business needs, market demands and... 
    Training
    Odd job
    Work at office
    Local area
    Remote work
    Relocation
    Relocation package

    OneHouse LLC

    Sunnyvale, CA
    4 days ago
  • $176k - $420k

     ...Our reinforcement and imitation learning engineers are responsible for end-to-end robotic...  ...Numpy and Pytorch) Experience with distributed deep learning systems Exposure to robot...  ...sensors is a plus Proven track record of training and deploying real world neural... 
    Training
    Hourly pay
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    21 hours ago
  • $140k - $312k

     ...expertise in machine learning, numerical optimization, software engineering, distributed systems, electricity markets, and trading. We have a proven...  ..., CAISO, PJM, AEMO, UK National Grid). Prefer academic training in numerical optimization, operations research, stochastic... 
    Training
    Hourly pay
    Temporary work
    Worldwide
    Flexible hours

    Tesla Motors, Inc.

    Palo Alto, CA
    1 day ago
  • $101k - $198k

     ...and ERP system landscape. The NetSuite engineer will work closely with stakeholders across...  ...system configurations, procedures, and training materials, to reflect any changes or enhancements...  ..., the most widely available, globally distributed data platform on the market, helps... 
    Training
    Local area
    Worldwide
    Flexible hours

    MongoDB

    Palo Alto, CA
    1 day ago
  • $150k - $300k

    ## Distinguished Engineer, Applied AIApplylocations: Palo Alto, CAtime type: Full timeposted...  ...technical capabilities across AI/ML, distributed systems, and operational excellence while...  ...’s work experience, education and training, the work location as well as market and... 
    Training
    Hourly pay
    Work experience placement
    Local area
    Flexible hours
    Shift work

    GEICO

    Palo Alto, CA
    21 hours ago
  • $192k - $260k

     ...growing SaaS companies in the world. Our engineering teams build highly technical products...  .... Optional: MS or PhD in databases, distributed systems. Comfortable working towards a...  ...experience, relevant certifications and training, and specific work location. Based on the... 
    Training
    Work at office
    Local area

    Menlo Ventures

    Mountain View, CA
    2 days ago
  • $55.85 - $74 per hour

     ...Join Stanford Health Care as a Project Engineer! Are you ready to make a meaningful impact...  ...for all document control and distribution processes and systems for the Planning,...  ..., experience, education, specialty and training. This pay scale is not a promise of a particular... 
    Training
    Hourly pay
    Contract work
    For contractors
    Work experience placement
    Work at office

    Stanford Health Care

    Palo Alto, CA
    1 day ago
  •  ...Project Engineer For over 40 years, Pete Moffat Construction has earned a reputation...  ...project documentation and information distribution while actively supporting the project team...  ..., certifications, and any relevant training. This position is not eligible for immigration... 
    Training
    For subcontractor
    Work at office
    Local area

    Pete Moffat Construction

    Palo Alto, CA
    1 day ago
  • $80k - $100k

     ...The Project Engineer plays a pivotal role in project coordination, overseeing tasks such...  ...Drawing Set and logs utilizing Fieldwire Distribute project drawings, design changes, RFIs,...  ...~ Sick Time ~ OSHA 10/30 Training ~ Commuter Benefits (CA only) ~ Gym... 
    Training
    Full time
    For contractors
    For subcontractor
    Internship
    Work at office
    Visa sponsorship
    Monday to Friday
    Flexible hours

    NOVO Construction

    Menlo Park, CA
    4 days ago
  • $150k - $230k

     ...Senior Systems Engineer - AI Infrastructure On Site, Palo Alto, California About the Role We're building infrastructure for fault-tolerant, high-performance distributed GPU training. You'll work at the intersection of GPU systems, high-speed networking, and distributed... 
    Training

    Clockwork Systems

    Palo Alto, CA
    4 days ago
  • $180k

     ...optimize massive GPU clusters, ensuring fast and reliable AI training. Ideal candidates will possess deep programming skills, GPU kernel...  ...optimization experience, and a strong grasp of large-scale distributed systems. This role offers a competitive salary range of $180,0... 
    Training

    xAI

    Palo Alto, CA
    21 hours ago
  • $300k - $400k

     ...systems layer that makes our frontier model training and inference fast, efficient, and...  ...kernels, communication primitives, or distributed training collective operations Profiling...  ...of the world's best — the scientists, engineers, and problem-solvers who don't just... 
    Training
    Visa sponsorship
    Flexible hours
    Shift work

    Periodic Labs

    Menlo Park, CA
    21 hours ago
  • $140k - $185k

     ...lives at risk. Our systems operate with distributed control, dynamic routing, and real-time...  ...We are seeking a Senior Network Systems Engineer to deploy, operate, and troubleshoot Vektor...  ...work experience, education, specialized training, critical expertise, training, and more.... 
    Training
    Full time
    Temporary work
    Work experience placement
    Local area
    Remote work

    Forterra, Inc.

    East Palo Alto, CA
    4 days ago
  •  ...environments and handling scenarios unseen in training. We work at the intersection of large-...  ...'re hiring a Staff/Principal ML Systems Engineer to own training performance end-to-end...  ...GPU counts Drive measurable gains in: Distributed efficiency (overlap, bucket sizing, rank... 
    Training

    Rhoda AI

    Palo Alto, CA
    1 day ago
  •  ...environments and handling scenarios unseen in training. We work at the intersection of large-...  ...verification and validation Define, engineer, deploy, and employ system safety...  ...architectures for compute, networking, and power distribution Why This Role Define the safety... 
    Training

    Rhoda ai

    Palo Alto, CA
    2 days ago
  • $60 per hour

     ...Description Software QA / Test Automation Engineer We are seeking a highly skilled...  ...Perform root-cause analysis of complex, distributed system failures. Utilize system logs, network...  ...Bonus Programs ~ Certification and training opportunities Note: Any pay ranges... 
    Training

    Yoh, A Day & Zimmermann Company

    Palo Alto, CA
    9 days ago
  • $180k - $210k

     ...About the job As part of the Electrical Engineering team, you will lead the design,...  ...architectures (300V+), including power distribution, energy storage systems, and power conversion...  ...mitigating risks through assessments and training, encouraging open dialogue on safety... 
    Training
    Work at office

    Pivotal

    Palo Alto, CA
    1 day ago
  • Unconventional AI in Palo Alto seeks a key contributor to develop a next-generation ML model training platform. You will optimize training stacks and design distributed systems, pushing boundaries in computing efficiency. Candidates should possess an MS/PhD in a quantitative... 
    Training

    Unconventional AI

    Palo Alto, CA
    21 hours ago
  • $172k - $225.7k

     ...business value. The Security Applied Field Engineering (AFE) organization is at the forefront...  ...Secure Sandboxing to protect sensitive training and inference data. Platform...  ...techniques including logging, monitoring, and distributed tracing on a platform level.... 
    Training
    Flexible hours

    Streamlit

    Menlo Park, CA
    4 days ago
  •  ...create it. You'll work alongside world-class engineers and researchers to design, prototype,...  ...create systems that help our AI teams train better robotic policies. • Currently pursuing...  ...• Comfortable designing robust power distribution and management systems with protection... 
    Training
    Full time
    Internship
    Flexible hours

    Proception Inc.

    Palo Alto, CA
    21 hours ago
  •  ...that fuels it, recursively accelerating the path to artificial superintelligence. We are interested in best-in-class engineers to focus on a variety of challenges relating to scaling, low-level optimization, and core infrastructure for LLM training and inference.... 
    Training

    Ricursive Intelligence

    Palo Alto, CA
    1 day ago
  •  ...Kernel Engineer Tilde Research is a moonshot AI lab advancing mechanistic interpretability, new architectures, and pretraining science...  ...high-performance GPU kernels that are critical to scaling our training and inference workloads. Your work will enable faster iteration... 
    Training
    Full time
    Internship

    Tilde

    Palo Alto, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Distributed Training Engineer. Be the first to apply!