Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Distributed Training Engineer

Periodic Labs

Periodic Labs Job Posting

We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries. We are well funded and growing rapidly. Team members are owners who identity and solve problems without boundaries or bureaucracy. We eagerly learn new tools and new science to push forward our mission.

About the Role

You will optimize, operate and develop large-scale distributed LLM training systems that power AI scientific research. You will work closely with researchers to bring up, debug, and maintain mid-training and reinforcement learning workflows. You will build tools and directly support frontier-scale experiments to make Periodic Labs the world's best AI + science lab for physicists, computational materials scientists, AI researchers, and engineers. You will contribute open-source large scale LLM training frameworks.

You might thrive in this role if you have experience with:

  • Training on clusters with ≥5,000 GPUs
  • 5D parallel LLM training
  • Distributed training frameworks such as Megatron-LM, FSDP, DeepSpeed, TorchTitan
  • Optimizing training throughput for large scale Mixture-of-Expert models
Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Distributed Training Engineer in Menlo Park, CA vacancy
  • A leading robotics company in Palo Alto seeks a Staff/Principal ML Systems Engineer to enhance training performance for their innovative humanoid robots. You will optimize distributed training systems and engage closely with researchers to transform model changes into... 
    Training

    Rhoda AI

    Palo Alto, CA
    10 hours ago
  • $180k

     ...We are looking for people with strong ML & Distributed systems backgrounds. This role will work within our Research team, closely collaborating with researchers to build the platforms for training our next generation of foundation models. \n Responsibilities Work... 
    Training
    Full time
    Work experience placement

    Luma Ai

    Palo Alto, CA
    10 hours ago
  • A leading AI infrastructure company in California seeks a Member of Technical Staff — Training to design and optimize large-scale distributed training systems for frontier AI models. Candidates should have 5+ years of experience in ML systems and be proficient in Python... 
    Training

    RadixArk

    Palo Alto, CA
    1 day ago
  • $130k - $165k

     ...Senior/Staff Software Engineer At Forterra, we are unleashing autonomy at scale to transform...  ...lives at risk. Our systems operate with distributed control, dynamic routing, and real-time...  ...work experience, education, specialized training, critical expertise, training, and more.... 
    Training
    Full time
    Temporary work
    Work experience placement
    Local area

    Forterra, Inc.

    Palo Alto, CA
    4 days ago
  • $200k - $400k

     ...ultra-scale GPU supercomputing systems to train next-generation foundation models. We...  ...effort — driving communication performance, distributed reliability, and cross-layer...  ...We are looking for a deeply technical engineer to co-design and optimize the communication... 
    Training
    Full time
    Visa sponsorship

    Institute Of Foundation Models

    Sunnyvale, CA
    10 hours ago
  • $110k - $130k

     ...Description We're ALTEN Technology USA, an engineering company helping clients bring...  ...doers to join us. As a Low Voltage Distribution Validation Engineer, you will be responsible...  ...knowledge, qualifications, skills, education, training, and experience ALTEN Technology... 
    Training
    For contractors

    ALTEN Technology USA

    Foster, CA
    9 days ago
  • $170k - $260k

     ...a collective of visionary scientists, engineers, and entrepreneurs are dedicated to transforming...  ...a new era of biomedicine, with our LBM training leading to ground-breaking advancements...  ...utilization and efficiency. Distributed/Parallel Training: Implement distributed... 
    Training
    Work at office

    GenBio AI

    Palo Alto, CA
    1 day ago
  •  ...Runtime Engineer The era of pervasive AI has arrived. In this era, organizations will...  ...observability We build a high performance, distributed and scalable software execution...  ...support data-flow applications such as ML training and inference and HPC applications. We... 
    Training

    SambaNova Systems

    Palo Alto, CA
    1 day ago
  • $89.01k - $170.63k

    **Welcome!**.Facilities Power Distribution Electrical Engineer page is loaded## Facilities Power Distribution Electrical Engineerlocations: US, California...  ...to validate we’ve received the required documentation, training and systems manuals to help maintain the facilities... 
    Training
    For contractors
    Local area
    Immediate start
    Shift work

    Intel Corporation

    Santa Clara, CA
    2 days ago
  • $176k - $420k

     ...level Python (including Numpy and Pytorch) Experience with distributed deep learning systems Exposure to robot learning through tactile...  .../or vision-based sensors is a plus Proven track record of training and deploying real world neural networks Compensation... 
    Training
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    10 hours ago
  • $215k - $250k

     ...Onehouse Data Infrastructure Engineer Onehouse is a mission-driven company dedicated...  ...created large-scale data systems and globally distributed platforms that sit at the heart of some...  ...experience, relevant certifications and training, business needs, market demands and... 
    Training
    Odd job
    Work at office
    Local area
    Remote work
    Relocation
    Relocation package

    OneHouse LLC

    Sunnyvale, CA
    4 days ago
  • $140k - $312k

     ...expertise in machine learning, numerical optimization, software engineering, distributed systems, electricity markets, and trading. We have a proven...  ..., CAISO, PJM, AEMO, UK National Grid). Prefer academic training in numerical optimization, operations research, stochastic... 
    Training
    Hourly pay
    Temporary work
    Worldwide
    Flexible hours

    Tesla Motors, Inc.

    Palo Alto, CA
    1 day ago
  • $150k - $300k

    ## Distinguished Engineer, Applied AIApplylocations: Palo Alto, CAtime type: Full timeposted...  ...technical capabilities across AI/ML, distributed systems, and operational excellence while...  ...’s work experience, education and training, the work location as well as market and... 
    Training
    Hourly pay
    Work experience placement
    Local area
    Flexible hours
    Shift work

    GEICO

    Palo Alto, CA
    10 hours ago
  • $80k - $160k

    GEICO . For more information, please .Engineer II page is loaded## Engineer IIremote type...  ..., design, and build scalable, resilient distributed systems* Engage in cross-functional collaboration...  ...’s work experience, education and training, the work location as well as market and... 
    Training
    Hourly pay
    Work experience placement
    Internship
    Local area
    Flexible hours
    Shift work

    GEICO

    Palo Alto, CA
    2 days ago
  • $166k - $225k

     ...to improve their business. Founded by engineers — and customer obsessed — we leap at every...  ...will be building the next generation distributed data storage and processing systems that...  ..., relevant certifications and training, and specific work location. Based on the... 
    Training
    Local area
    Worldwide

    Databricks Inc.

    Mountain View, CA
    4 days ago
  • $192k - $260k

    Staff Software Engineer - Distributed Data Systems Mountain View, California P-186 At Databricks, we are obsessed with enabling data teams...  ...related skills, depth of experience, relevant certifications and training, and specific work location. Based on the factors above,... 
    Training
    Work at office
    Local area

    Databricks Inc.

    Mountain View, CA
    4 days ago
  • $180k

     ...optimize massive GPU clusters, ensuring fast and reliable AI training. Ideal candidates will possess deep programming skills, GPU kernel...  ...optimization experience, and a strong grasp of large-scale distributed systems. This role offers a competitive salary range of $180,0... 
    Training

    xAI

    Palo Alto, CA
    10 hours ago
  • $300k - $400k

     ...systems layer that makes our frontier model training and inference fast, efficient, and...  ...kernels, communication primitives, or distributed training collective operations Profiling...  ...of the world's best — the scientists, engineers, and problem-solvers who don't just... 
    Training
    Visa sponsorship
    Flexible hours
    Shift work

    Periodic Labs

    Menlo Park, CA
    10 hours ago
  • $140k - $185k

     ...lives at risk. Our systems operate with distributed control, dynamic routing, and real-time...  ...We are seeking a Senior Network Systems Engineer to deploy, operate, and troubleshoot Vektor...  ...work experience, education, specialized training, critical expertise, training, and more.... 
    Training
    Full time
    Temporary work
    Work experience placement
    Local area
    Remote work

    Forterra, Inc.

    East Palo Alto, CA
    4 days ago
  • $188.5k - $282.7k

    Rubrik, Inc. is seeking a Senior Software Engineer for its Atlas Distributed Systems team. You'll design and deliver innovative solutions for cloud storage while guiding architectural trends within our distributed file systems. The ideal candidate has a degree in Computer... 

    Rubrik, Inc.

    Palo Alto, CA
    10 hours ago
  • $160.85k - $178k

     ...glance The WR & CI Group is responsible for the engineering, operation and maintenance of the water distribution, storm drainage and sanitary sewer systems, and roads...  ...or global leader speak. $6,000+ in tuition and training assistance annually. Up to 50% of Stanford's... 
    Training
    For contractors
    Work at office
    Immediate start

    Another Source

    Palo Alto, CA
    2 days ago
  •  ...environments and handling scenarios unseen in training. We work at the intersection of large-...  ...'re hiring a Staff/Principal ML Systems Engineer to own training performance end-to-end...  ...GPU counts Drive measurable gains in: Distributed efficiency (overlap, bucket sizing, rank... 
    Training

    Rhoda AI

    Palo Alto, CA
    1 day ago
  •  ...environments and handling scenarios unseen in training. We work at the intersection of large-...  ...verification and validation Define, engineer, deploy, and employ system safety...  ...architectures for compute, networking, and power distribution Why This Role Define the safety... 
    Training

    Rhoda ai

    Palo Alto, CA
    2 days ago
  • $180k - $210k

     ...About the job As part of the Electrical Engineering team, you will lead the design,...  ...architectures (300V+), including power distribution, energy storage systems, and power conversion...  ...mitigating risks through assessments and training, encouraging open dialogue on safety... 
    Training
    Work at office

    Pivotal

    Palo Alto, CA
    1 day ago
  • $172k - $225.7k

     ...business value. The Security Applied Field Engineering (AFE) organization is at the forefront...  ...Secure Sandboxing to protect sensitive training and inference data. Platform...  ...techniques including logging, monitoring, and distributed tracing on a platform level.... 
    Training
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    2 days ago
  • $207k - $300k

    Site Reliability Engineering Manager, Google Distributed Cloud Google Sunnyvale, CA, USA Bachelor’s degree in Computer Science, a related field, or equivalent...  ...‑related skills, experience, and relevant education or training. Your recruiter can share more about the specific... 
    Training
    Full time

    Google Inc.

    Sunnyvale, CA
    2 days ago
  • A leading cybersecurity firm in Palo Alto is seeking a Senior / Principal Software Engineer. The role focuses on developing and maintaining distributed systems and databases to enhance security features. Candidates must have over 5 years of experience in software development... 

    xage, inc

    Palo Alto, CA
    4 days ago
  • $48.5 - $64.31 per hour

     ...level and purpose of the job. The Project Engineer plans, organizes, manages, and is responsible for all document control and distribution processes and systems for the Planning,...  ..., participation in ongoing education and training, communication and adherence to safety... 
    Training
    Hourly pay
    Contract work
    For contractors
    Work experience placement
    Work at office

    Lucile Packard Children's Hospital Stanford

    Palo Alto, CA
    15 days ago
  • $230k - $360k

     ...Lead Infrastructure And Reliability Engineer (Systems & Scale) Our Infrastructure Engineering...  ...make heroics unnecessary Scaling Training & Inference Define how...  ...Required: Deep expertise in Linux and distributed systems Experience operating GPU / accelerator... 
    Training

    Luma AI

    Palo Alto, CA
    4 days ago
  •  ...Kernel Engineer Tilde Research is a moonshot AI lab advancing mechanistic interpretability, new architectures, and pretraining science...  ...high-performance GPU kernels that are critical to scaling our training and inference workloads. Your work will enable faster iteration... 
    Training
    Full time
    Internship

    Tilde

    Palo Alto, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Distributed Training Engineer. Be the first to apply!