AI Infrastructure — Training Engineer (Large Model) [33251]
Stealth Startup
Responsibilities
- Distributed training framework optimization. Own the R&D and tuning of distributed training frameworks for large models (LLMs, multimodal), resolving scalability bottlenecks at the scale of 10k–100k GPU clusters.
- Kernel & performance tuning. Work close to the underlying hardware (NVIDIA GPU / NPU) on kernel acceleration, memory optimization, and communication optimization (tensor parallelism, pipeline parallelism, ZeRO, and related techniques).
- System resilience & scheduling. Build stable large-scale training clusters; design high-availability fault-tolerance mechanisms (checkpoint/resume, automatic recovery) and compute scheduling strategies to raise overall cluster throughput and resource utilization.
- Training pipeline engineering. Build an end-to-end MLOps platform spanning data preprocessing, distributed training, model fine-tuning (RLHF / DPO, etc.), and automated evaluation.
Qualifications
- Education. Bachelor's degree or above in Computer Science, Software Engineering, Electrical Engineering, or a related field.
- Programming. Very strong engineering implementation skills; proficient in C/C++ and Python, with a solid foundation in data structures and algorithms.
- Distributed & parallel computing. Hands-on mastery of mainstream distributed training frameworks such as PyTorch, Megatron-LM, DeepSpeed, DeepSpeed-Chat, or Horovod.
- Low-level systems & communication. Familiar with Linux internals, the network stack (RoCE/RDMA), GPU communication primitives (e.g., NCCL), and common storage systems.
- Tuning & debugging. Skilled with profiling and debugging tools such as Nsight, GDB, and PyTorch Profiler; able to quickly diagnose cluster deadlocks, performance bottlenecks, and out-of-memory (OOM) issues.
Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the AI Infrastructure — Training Engineer (Large Model) [33251] in Menlo Park, CA vacancy
$179k - $223.8k
...seeking Senior Data Engineers to play a pivotal role... ...the development of our Large Driving Model (LDM). In this role,... ...of large-scale training sets to improve the robustness... ...the design of data infrastructure, making informed... ...understanding of data‑centric AI, dataset curation...TrainingFull timeTemporary workPart timeLocal areaShift work$179k - $223.8k
...seeking Senior Data Engineers to play a pivotal role... ...the development of our Large Driving Model (LDM). In this role,... ...of large-scale training sets to improve the robustness... ...the design of data infrastructure, making informed... ...understanding of data-centric AI, dataset curation...TrainingFull timeContract workTemporary workPart timeLocal areaShift work$181.1k - $318.4k
...States Machine Learning and AI Apple is where... ...Description As a Senior/Staff Engineer on the Foundation Model Compute Infrastructure team, you will lead the... ...orchestration systems for large‑scale TPU workloads... ...execution of large‑scale training and inference jobs. This...TrainingRelocation$150k
...mission is to create AI systems that can... ..., and focused on engineering excellence. This organization... ...the Grok Voice Model team to help build... .... We own the full training pipeline: massive... ...and execute large‑scale speech data... ...and experimentation infrastructure to measure and...TrainingTemporary work$224k - $356.5k
...is searching for a senior or principal engineer who specializes in building cutting‑edge infrastructure for large‑scale foundation model training in the Generalist Embodied Agent Research... ...models, large-scale robot learning, embodied AI, and physics simulation. Our past...TrainingFull time$175k - $350k
...At Inflection AI, our public benefit mission is to harness the... ...with human-centered AI models that unite emotional intelligence... ...and perspectives. Platform — large-language models (LLMs) and APIs... .... About the Role As a Model Training engineer, you will design, build, and...Training- ...Technical Staff - Foundation Model Architecture & AI Infrastructure Vinci | Full-Time |... ...realistic production workloads. Trained on 45TB+ of structured... ...architecture and systems engineering - not low-level GPU... ...For Deep experience in: Large-scale foundation model architecture...TrainingFull timeRemote work
$180k
...s mission is to create AI systems that can accurately... ..., and focused on engineering excellence. This organization... ..., agentic planning, RL training, and world simulation (... ..., evals, and reward models tailored to image/video... ...or working with large-scale distributed machine...TrainingTemporary work- ...Staff — Diffusion Model About the Role RadixArk... ...with strong engineering execution — from designing... ...algorithms to training and deploying... ...generation generative AI systems used by... ...experience training large-scale models on... ...RadixArk RadixArk is an infrastructure-first AI company...TrainingFlexible hours
$295.5k - $335.3k
...developer of Embodied AI technology. Our... ...and foundation models enable vehicles to... ...career! As Principal Engineer for the Model... ...data ingestion and training to experiment scheduling... ...of AI research, large-scale distributed... ...vision, aligning infrastructure and tooling with...TrainingFull time- ...scale the operator intelligence layer in AI infrastructure. You will design transformer frameworks for large-scale datasets, manage distributed training, and ensure robust production systems.... ...have deep expertise in foundation model architecture and experience with production...Training
- ...At Rhoda AI, we're building the full-stack foundation... ...to the foundational models and video world models... ...handling scenarios unseen in training. We work at the intersection of large-scale learning,... ...looking for an Inference Infrastructure Engineer to help build and operate...Training
- At Rhoda AI, we're building the full-stack foundation... ...to the foundational models and video world models... ...handling scenarios unseen in training. We work at the intersection of large-scale learning,... ...looking for an Inference Infrastructure Engineer to help build and operate...Training
$180k
...mission is to create AI systems that can accurately... ..., and focused on engineering excellence. This... ...then optimize it to our training models and how we execute customer... ...build‑out new GPU infrastructure with little to no... ...designing and operating large scale networks with 5...TrainingTemporary work- ...Job Title: CW Research on Large Vehicle Data Model - Summer Intern (99W210)... ...including pretraining and post-training, leveraging language... ...sensors, edge, and cloud infrastructure using standard datasets and... ...understanding. Software engineering fundamentals, especially...TrainingSummer internshipVisa sponsorshipWork visa
- ...We are seeking a top-tier AI Scientist / Engineer to join our "Mars-shot" AI... ...Develop novel approaches and train models for prediction of... ...Wrangle/stand up/leverage large structured data sets and foundational... ...domain Oversee data infrastructure across pipeline from software...TrainingWork at office
$180k
...mission is to create AI systems that can... ...motivated, and focused on engineering excellence. This... ...for exceptional ML Infrastructure Engineers with deep... ...fabric that powers large‑scale AI training and inference clusters... ...wave of frontier AI models. Annual Base Salary...TrainingTemporary workWork at office- ...Machine Learning Infrastructure Engineer At Mind Robotics, we're building generalized physical AI—robotic systems capable of dexterous... ...to iterate quickly on large-scale models depends on world-class ML... ...reliable, and scalable model training—powering everything from experimentation...Training
- ...in Palo Alto is seeking a Machine Learning Infrastructure Engineer to design and implement scalable systems for training large ML models. The role involves developing and optimizing... ...chance to work at the forefront of robotic AI technology. #J-18808-Ljbffr Mind Robotics...Training
$254k - $349.25k
...how people, data, and AI agents connect... ...Fortune 100, 10,000 large enterprises, and millions... ...deep expertise in model architecture, training, fine-tuning, and distillation... ...teams on ML infrastructure, data pipelines, and... ..., etc.) Systems & Engineering Experience...TrainingFlexible hours- ...At Rhoda AI, we're building the full-stack foundation... ...to the foundational models and video world models... ...handling scenarios unseen in training. We work at the intersection of large-scale learning,... ...re looking for a Cloud Infrastructure Engineer to build and operate the...Training
$150k
...Institute of Foundation Models We are a dedicated... ...generation of AI builders, and... ...foundation model training, alongside world‑class... ...data scientists, and engineers, tackling the most... ...focused on ML infrastructure and MLOps to design... ...distributed systems for large‑scale data...TrainingVisa sponsorship$254k - $349.25k
...how people, data, and AI agents connect... ...Fortune 100, 10,000 large enterprises, and millions... ...deep expertise in model architecture, training, fine‑tuning, and distillation... ...teams on ML infrastructure, data pipelines, and... ..., etc.) Systems & Engineering Experience designing...TrainingFlexible hours- ...0x better job search engine: fast, comprehensive,... ...help us turn powerful AI and ML models into fast, reliable... ...and real user-facing infrastructure: deploying models, optimizing... ...integrate researcher-trained model checkpoints... ...Have experience with large-scale model serving,...TrainingRelocation package
$110k - $153k
...Job Category Software Engineering Job Details Salesforce is the #1 AI CRM, where humans with agents drive customer... ...code and artifacts generated by large language model (LLM) coding assistants and... ...compensation, promotion, benefits, training, assessment of job performance,...TrainingInternship$236k - $339.2k
...Staff Software Engineer - Container Platform At Snowflake,... ...usher in this new era, we seek AI-native thinkers across... .../ML footprint. Hundreds of large Kubernetes clusters under management... ...Experience with GPU infrastructure or AI/ML training workloads at scale. Open...TrainingFlexible hours$224k - $356.5k
...unlimited potential of AI to define the next era... ...Deep Learning Engineer — Model Evaluation & AI Systems... ...reproducible evaluation infrastructure, including harnesses,... ...pipelines running on large GPU clusters. Collaborate... ...Work alongside model training, inference, and product...Training$215.28k - $364.32k
...Staff Machine Learning Engineer - Foundation Model Santa Clara, CA XPENG is a leading... ...innovation, integrating advanced AI and autonomous driving... ...perception and planning engineers, and infrastructure experts to design, train, and deploy large-scale multi-modal models that...TrainingFull time$262k - $365k
Software Engineering Manager, 3D World Model, Geospatial AI Google Mountain View, CA, USA Qualifications... ...distributed computing, large-scale system design,... ...relevant education or training. Responsibilities Set... ...requirements and infrastructure needs. Oversee systems...Training$145.1k - $273.2k
...hardware logic of various AI accelerators ;... ...architectures in the context of Large Language Model (LLM) inference and training. 2.Operator &... ...technologies within cloud infrastructure. Who We Look For... ....D. degree in Computer Engineering, Electronic Engineering...TrainingRelocation package
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to AI Infrastructure — Training Engineer (Large Model) [33251]. Be the first to apply!
Related searches
- ai developer Menlo Park, CA
- ai prompt engineer Menlo Park, CA
- ai engineer Menlo Park, CA
- senior ai engineer Menlo Park, CA
- infrastructure engineer Menlo Park, CA
- infrastructure engineering manager Menlo Park, CA
- senior infrastructure engineer Menlo Park, CA
- infrastructure developer Menlo Park, CA
- principal infrastructure engineer Menlo Park, CA
- azure ai engineer

