Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

AI Infrastructure — Training Engineer (Large Model) [33251]

Stealth Startup

Responsibilities

  • Distributed training framework optimization. Own the R&D and tuning of distributed training frameworks for large models (LLMs, multimodal), resolving scalability bottlenecks at the scale of 10k–100k GPU clusters.
  • Kernel & performance tuning. Work close to the underlying hardware (NVIDIA GPU / NPU) on kernel acceleration, memory optimization, and communication optimization (tensor parallelism, pipeline parallelism, ZeRO, and related techniques).
  • System resilience & scheduling. Build stable large-scale training clusters; design high-availability fault-tolerance mechanisms (checkpoint/resume, automatic recovery) and compute scheduling strategies to raise overall cluster throughput and resource utilization.
  • Training pipeline engineering. Build an end-to-end MLOps platform spanning data preprocessing, distributed training, model fine-tuning (RLHF / DPO, etc.), and automated evaluation.

Qualifications

  • Education. Bachelor's degree or above in Computer Science, Software Engineering, Electrical Engineering, or a related field.
  • Programming. Very strong engineering implementation skills; proficient in C/C++ and Python, with a solid foundation in data structures and algorithms.
  • Distributed & parallel computing. Hands-on mastery of mainstream distributed training frameworks such as PyTorch, Megatron-LM, DeepSpeed, DeepSpeed-Chat, or Horovod.
  • Low-level systems & communication. Familiar with Linux internals, the network stack (RoCE/RDMA), GPU communication primitives (e.g., NCCL), and common storage systems.
  • Tuning & debugging. Skilled with profiling and debugging tools such as Nsight, GDB, and PyTorch Profiler; able to quickly diagnose cluster deadlocks, performance bottlenecks, and out-of-memory (OOM) issues.

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the AI Infrastructure — Training Engineer (Large Model) [33251] in Menlo Park, CA vacancy
  • $179k - $223.8k

     ...seeking Senior Data Engineers to play a pivotal role...  ...the development of our Large Driving Model (LDM). In this role,...  ...of large-scale training sets to improve the robustness...  ...the design of data infrastructure, making informed...  ...understanding of data‑centric AI, dataset curation... 
    Training
    Full time
    Temporary work
    Part time
    Local area
    Shift work

    Rivian

    Palo Alto, CA
    22 hours ago
  • $179k - $223.8k

     ...seeking Senior Data Engineers to play a pivotal role...  ...the development of our Large Driving Model (LDM). In this role,...  ...of large-scale training sets to improve the robustness...  ...the design of data infrastructure, making informed...  ...understanding of data-centric AI, dataset curation... 
    Training
    Full time
    Contract work
    Temporary work
    Part time
    Local area
    Shift work

    Rivian

    Palo Alto, CA
    9 hours ago
  • $181.1k - $318.4k

     ...States Machine Learning and AI Apple is where...  ...Description As a Senior/Staff Engineer on the Foundation Model Compute Infrastructure team, you will lead the...  ...orchestration systems for large‑scale TPU workloads...  ...execution of large‑scale training and inference jobs. This... 
    Training
    Relocation

    Apple Inc.

    Santa Clara, CA
    3 days ago
  • $150k

     ...mission is to create AI systems that can...  ..., and focused on engineering excellence. This organization...  ...the Grok Voice Model team to help build...  .... We own the full training pipeline: massive...  ...and execute large‑scale speech data...  ...and experimentation infrastructure to measure and... 
    Training
    Temporary work

    Pantera Capital

    Palo Alto, CA
    2 days ago
  • $224k - $356.5k

     ...is searching for a senior or principal engineer who specializes in building cutting‑edge infrastructure for large‑scale foundation model training in the Generalist Embodied Agent Research...  ...models, large-scale robot learning, embodied AI, and physics simulation. Our past... 
    Training
    Full time

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $175k - $350k

     ...At Inflection AI, our public benefit mission is to harness the...  ...with human-centered AI models that unite emotional intelligence...  ...and perspectives. Platform — large-language models (LLMs) and APIs...  .... About the Role As a Model Training engineer, you will design, build, and... 
    Training

    Inflection AI

    Palo Alto, CA
    4 days ago
  •  ...Technical Staff - Foundation Model Architecture & AI Infrastructure Vinci | Full-Time |...  ...realistic production workloads. Trained on 45TB+ of structured...  ...architecture and systems engineering - not low-level GPU...  ...For Deep experience in: Large-scale foundation model architecture... 
    Training
    Full time
    Remote work

    Vinci4d

    Palo Alto, CA
    9 hours ago
  • $180k

     ...s mission is to create AI systems that can accurately...  ..., and focused on engineering excellence. This organization...  ..., agentic planning, RL training, and world simulation (...  ..., evals, and reward models tailored to image/video...  ...or working with large-scale distributed machine... 
    Training
    Temporary work

    Xai

    Palo Alto, CA
    9 hours ago
  •  ...Staff — Diffusion Model About the Role RadixArk...  ...with strong engineering execution — from designing...  ...algorithms to training and deploying...  ...generation generative AI systems used by...  ...experience training large-scale models on...  ...RadixArk RadixArk is an infrastructure-first AI company... 
    Training
    Flexible hours

    RadixArk

    Palo Alto, CA
    9 hours ago
  • $295.5k - $335.3k

     ...developer of Embodied AI technology. Our...  ...and foundation models enable vehicles to...  ...career! As Principal Engineer for the Model...  ...data ingestion and training to experiment scheduling...  ...of AI research, large-scale distributed...  ...vision, aligning infrastructure and tooling with... 
    Training
    Full time

    Wayve

    Sunnyvale, CA
    3 days ago
  •  ...scale the operator intelligence layer in AI infrastructure. You will design transformer frameworks for large-scale datasets, manage distributed training, and ensure robust production systems....  ...have deep expertise in foundation model architecture and experience with production... 
    Training

    Getvinci

    Palo Alto, CA
    5 days ago
  •  ...At Rhoda AI, we're building the full-stack foundation...  ...to the foundational models and video world models...  ...handling scenarios unseen in training. We work at the intersection of large-scale learning,...  ...looking for an Inference Infrastructure Engineer to help build and operate... 
    Training

    Rhoda AI

    Palo Alto, CA
    4 days ago
  • At Rhoda AI, we're building the full-stack foundation...  ...to the foundational models and video world models...  ...handling scenarios unseen in training. We work at the intersection of large-scale learning,...  ...looking for an Inference Infrastructure Engineer to help build and operate... 
    Training

    Rhoda ai

    Palo Alto, CA
    3 days ago
  • $180k

     ...mission is to create AI systems that can accurately...  ..., and focused on engineering excellence. This...  ...then optimize it to our training models and how we execute customer...  ...build‑out new GPU infrastructure with little to no...  ...designing and operating large scale networks with 5... 
    Training
    Temporary work

    Xai

    Palo Alto, CA
    9 hours ago
  •  ...Job Title: CW Research on Large Vehicle Data Model - Summer Intern (99W210)...  ...including pretraining and post-training, leveraging language...  ...sensors, edge, and cloud infrastructure using standard datasets and...  ...understanding. Software engineering fundamentals, especially... 
    Training
    Summer internship
    Visa sponsorship
    Work visa

    Kyyba

    Mountain View, CA
    1 day ago
  •  ...We are seeking a top-tier AI Scientist / Engineer to join our "Mars-shot" AI...  ...Develop novel approaches and train models for prediction of...  ...Wrangle/stand up/leverage large structured data sets and foundational...  ...domain Oversee data infrastructure across pipeline from software... 
    Training
    Work at office

    Conquest

    Palo Alto, CA
    4 days ago
  • $180k

     ...mission is to create AI systems that can...  ...motivated, and focused on engineering excellence. This...  ...for exceptional ML Infrastructure Engineers with deep...  ...fabric that powers large‑scale AI training and inference clusters...  ...wave of frontier AI models. Annual Base Salary... 
    Training
    Temporary work
    Work at office

    Pantera Capital

    Palo Alto, CA
    4 days ago
  •  ...Machine Learning Infrastructure Engineer At Mind Robotics, we're building generalized physical AI—robotic systems capable of dexterous...  ...to iterate quickly on large-scale models depends on world-class ML...  ...reliable, and scalable model training—powering everything from experimentation... 
    Training

    Mind Robotics

    Palo Alto, CA
    4 days ago
  •  ...in Palo Alto is seeking a Machine Learning Infrastructure Engineer to design and implement scalable systems for training large ML models. The role involves developing and optimizing...  ...chance to work at the forefront of robotic AI technology. #J-18808-Ljbffr Mind Robotics... 
    Training

    Mind Robotics Inc.

    Palo Alto, CA
    2 days ago
  • $254k - $349.25k

     ...how people, data, and AI agents connect...  ...Fortune 100, 10,000 large enterprises, and millions...  ...deep expertise in model architecture, training, fine-tuning, and distillation...  ...teams on ML infrastructure, data pipelines, and...  ..., etc.) Systems & Engineering Experience... 
    Training
    Flexible hours

    Proofpoint

    Sunnyvale, CA
    1 day ago
  •  ...At Rhoda AI, we're building the full-stack foundation...  ...to the foundational models and video world models...  ...handling scenarios unseen in training. We work at the intersection of large-scale learning,...  ...re looking for a Cloud Infrastructure Engineer to build and operate the... 
    Training

    Rhoda AI

    Palo Alto, CA
    4 days ago
  • $150k

     ...Institute of Foundation Models We are a dedicated...  ...generation of AI builders, and...  ...foundation model training, alongside world‑class...  ...data scientists, and engineers, tackling the most...  ...focused on ML infrastructure and MLOps to design...  ...distributed systems for large‑scale data... 
    Training
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    4 days ago
  • $254k - $349.25k

     ...how people, data, and AI agents connect...  ...Fortune 100, 10,000 large enterprises, and millions...  ...deep expertise in model architecture, training, fine‑tuning, and distillation...  ...teams on ML infrastructure, data pipelines, and...  ..., etc.) Systems & Engineering Experience designing... 
    Training
    Flexible hours

    Proofpoint

    Sunnyvale, CA
    3 days ago
  •  ...0x better job search engine: fast, comprehensive,...  ...help us turn powerful AI and ML models into fast, reliable...  ...and real user-facing infrastructure: deploying models, optimizing...  ...integrate researcher-trained model checkpoints...  ...Have experience with large-scale model serving,... 
    Training
    Relocation package

    HiringCafe

    Cupertino, CA
    1 day ago
  • $110k - $153k

     ...Job Category Software Engineering Job Details Salesforce is the #1 AI CRM, where humans with agents drive customer...  ...code and artifacts generated by large language model (LLM) coding assistants and...  ...compensation, promotion, benefits, training, assessment of job performance,... 
    Training
    Internship

    Centaur Labs

    Palo Alto, CA
    9 hours ago
  • $236k - $339.2k

     ...Staff Software Engineer - Container Platform At Snowflake,...  ...usher in this new era, we seek AI-native thinkers across...  .../ML footprint. Hundreds of large Kubernetes clusters under management...  ...Experience with GPU infrastructure or AI/ML training workloads at scale. Open... 
    Training
    Flexible hours

    Streamlit

    Menlo Park, CA
    4 days ago
  • $224k - $356.5k

     ...unlimited potential of AI to define the next era...  ...Deep Learning Engineer — Model Evaluation & AI Systems...  ...reproducible evaluation infrastructure, including harnesses,...  ...pipelines running on large GPU clusters. Collaborate...  ...Work alongside model training, inference, and product... 
    Training

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $215.28k - $364.32k

     ...Staff Machine Learning Engineer - Foundation Model Santa Clara, CA XPENG is a leading...  ...innovation, integrating advanced AI and autonomous driving...  ...perception and planning engineers, and infrastructure experts to design, train, and deploy large-scale multi-modal models that... 
    Training
    Full time

    XPENG

    Santa Clara, CA
    1 day ago
  • $262k - $365k

    Software Engineering Manager, 3D World Model, Geospatial AI Google Mountain View, CA, USA Qualifications...  ...distributed computing, large-scale system design,...  ...relevant education or training. Responsibilities Set...  ...requirements and infrastructure needs. Oversee systems... 
    Training

    Google Inc.

    Mountain View, CA
    4 days ago
  • $145.1k - $273.2k

     ...hardware logic of various AI accelerators ;...  ...architectures in the context of Large Language Model (LLM) inference and training. 2.Operator &...  ...technologies within cloud infrastructure. Who We Look For...  ....D. degree in Computer Engineering, Electronic Engineering... 
    Training
    Relocation package

    Tencent

    Palo Alto, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Infrastructure — Training Engineer (Large Model) [33251]. Be the first to apply!