Staff, Pre-Training Infra — Distributed ML Training
B Capital
B Capital is seeking a talented engineer in San Francisco to build and scale distributed training systems for machine learning models. The ideal candidate will have strong experience in distributed training frameworks, debug GPU compute systems, and optimize training throughput. This role offers top-tier compensation, comprehensive health benefits, and the opportunity to work in a collaborative environment with daily meals and team celebrations. #J-18808-Ljbffr B Capital
- A leading technology firm in San Francisco is seeking a candidate to build and scale distributed training systems for large model pre-training. You will collaborate with research teams to design and operate training runs and enhance performance across distributed training...Training
- ...research organization is seeking an expert to build and scale distributed training systems for large machine learning models. The role involves... ...training frameworks and excellent debugging skills for large-scale ML pipelines. This position offers top-tier compensation and...Training
$270k - $340k
..., with technologies ranging from post‑training open source LLMs to developing advanced... ...‑art approaches. Optimize end‑to‑end ML systems for distributed training and RL, memory efficiency,... ...Collaborate with distributed systems and infra teams to push the limits of...TrainingLocal areaWorldwide$150k - $300k
...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...As a Research Engineer working on Distributed Training, you'll play a crucial role... ...with the latest advancements in AI/ML infrastructure and tools, decentralized...TrainingRemote workWorldwideVisa sponsorshipRelocation packageFlexible hours- ...Anthropic and beyond. About the Role Build and scale distributed training systems that power frontier model pre-training. Work closely with research teams to... ...distributed workloads. Experience working closely with ML researchers to productionize experimental training...TrainingRelocation package
$227.2k - $417k
...Software Engineer, ML Infra & Distributed Systems (Staff & Principal) San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote... ...FAISS), feature stores (e.g. Feast), ElastiCache, model training orchestration, etc. Understanding of ML model training...TrainingFull timeTemporary workLocal areaRemote workFlexible hours- ...to build scalable infrastructure for large‑scale training and fine-tuning of foundation models. You will design distributed training systems and optimize GPU utilization... ...Ideal candidates have over 5 years of experience in ML infrastructure and a strong background in...Training
- ...intelligence to serve humanity. We’re training and deploying frontier models for... ...typical "Applied Scientist" or "ML Engineer" role. As a Member of Technical Staff, Applied ML, you will: Work... ...Impact Mentor engineers across distributed teams. Drive clarity in ambiguous...TrainingFull timeWork at officeRemote workFlexible hours
- ML Systems Engineer - Robotics & AI We are building the full-stack... ...handling scenarios unseen in training. We work at the intersection... ...manufacturing scale-up. We are hiring a Staff/Principal ML Systems Engineer... .... Drive measurable gains in distributed efficiency, compute efficiency...Training
$325k
...fit if you: Have strong distributed systems, infrastructure, or... ...large-scale model serving or training infrastructure ( ~1000 GPUs)... ...Have experience with one or more ML hardware accelerators (GPUs, TPUs... ...: Currently, we expect all staff to be in one of our offices at...TrainingWork at officeVisa sponsorshipFlexible hours$117.2k - $313.7k
...new and exciting components/frameworks in distributed filesystems in an ever-growing and... ...Design patterns & Experience with Big-Data/ML and S3 Hands-on experience with Streaming... ...assignment, compensation, promotion, benefits, training, assessment of job performance,...TrainingImmediate startRemote work- ...firm in San Francisco is seeking an AI Infra Engineer to enhance their infrastructure... ...Kubernetes clusters and manage Slurm for distributed training. Important skills include extensive experience... ...team aiming at advancements in AI and ML infrastructure. #J-18808-Ljbffr...Training
$200k - $280k
...architectures, engines) and post-training / RL systems. We build... ...), GPU performance, distributed serving. RL-first... ...computing for ML. Are comfortable... ...enjoy collaborating with infra, research, and product... ...technical leadership (Staff level) Set technical...TrainingFull time- ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...for synthetic data generation and distributed RL frameworks. Publish research in... ...with the latest advancements in AI/ML infrastructure and tools, synthetic...TrainingRemote workWorldwideVisa sponsorshipRelocation packageFlexible hours
$181.1k - $318.4k
Apple Inc. is looking for a Staff ML Infrastructure Engineer in San Francisco to lead pre-training initiatives for cutting-edge foundation models in machine learning... ...and Go, and possess strong knowledge of distributed systems and containerization. The role offers...Training- ...ML Infrastructure Engineer In this role you will help scale and optimize our training systems and core model code. You'll own critical infrastructure... ...metrics/logging. Scale distributed training: Work with... ...Translate research needs into infra capabilities and guide best...Training
$233.5k - $350.5k
...GoFundMe team is searching for our next Senior Staff Data Platform Architect to help design... ...decision-making, experimentation, and AI/ML across GoFundMe. This is a highly... ...including skills, experience, education, or training. Your recruiter can share more about the specific...TrainingFull timeWork at officeFlexible hours$190k - $250k
...comprehensive platform to manage heart disease. As a Staff Data Architect, you will lead the data... ...systems through curated analytical and ML-ready datasets. Advance the Semantic... ...Heartflow, including recruitment, hiring, training, relocation, promotion, and termination....TrainingWork experience placementLocal areaWorldwideRelocation$150k - $300k
...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...meet throughput/latency SLOs. Model Distribution: Optimize model distribution and... ...Requirements Required Experience Building ML Systems at Scale: 3+ years building...TrainingWork at officeRemote workVisa sponsorshipRelocation packageFlexible hoursShift work- ...something! Description As an engineer on ML Compute team, your work will include: Drive large-scale pre-training initiatives to support cutting-edge... ...scalability, and resource optimization. Enhance distributed training techniques for foundation models....Training
$180k - $350k
...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...: the hosted RL training platform, distributed GPU infrastructure, liquid compute... ...Experience securing GPU infrastructure or ML training pipelines Background in...TrainingWork at officeRemote workVisa sponsorshipRelocation packageFlexible hours$150k - $300k
...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...workload management You will work on a distributed system with performance engineering... ...Experience with GPU computing and ML infrastructure Knowledge of AI/ML...TrainingFull timeWork at officeRemote workVisa sponsorshipRelocation packageFlexible hours$225k
...alone. Our approach combines frontier-scale pre-training, domain-specific RL, ultra-long context,... ...team, you will design and operate the distributed infrastructure that trains Magic’s long-... ...cross-layer issues in production ML systems Strong ownership mindset and ability...TrainingRelocationVisa sponsorship$320k
...Strong Candidates May Also Have Experience with ML inference or training infrastructure deployment, particularly across multiple accelerator... ...Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time....TrainingWork at officeVisa sponsorshipFlexible hoursShift work$208k - $276k
Staff Solutions Architect, Data Solutions page is loaded## Staff Solutions... ...needs. # **Responsibilities*** **Pre-sales solutions consulting**: The... ...experience designing and implementing AI/ML solutions, including model development, training, and deployment, with a deep...TrainingFull timeContract work- ...purpose AI for the physical world. Training our models requires orchestrating... ..., and making large-scale distributed training seamless. The Team The ML Infrastructure team supports and... ...fast. You will work closely with ML Infra (training systems), data platform...TrainingFlexible hours
- ...formats like PDFs and spreadsheets. We train vision models to read those documents the... ...product. The Opportunity As an ML Infra Engineer , you'll play a key role in... ...reliability and observability. Scale distributed training and inference workloads across...TrainingWork at officeLocal area
$320k - $405k
...beneficial AI systems. Staff Infrastructure Engineer, Node Infra About the role... ...determine how quickly we can train new models, how reliably... ...Deep expertise in distributed systems, reliability, and... ...InfiniBand) for distributed ML workloads. ~ Demonstrated...TrainingWork at officeVisa sponsorshipFlexible hours- A leading AI research company in San Francisco seeks Senior/Staff Engineers skilled in distributed systems and large-scale ML training. Responsibilities include designing systems optimized for low-bandwidth conditions and implementing robust training strategies. Ideal...TrainingRemote work
$160k - $200k
...superintelligence stack - from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and orchestrate global... ...and discussing cluster performance with a customer's ML infrastructure team in the morning, and partnering with...TrainingRemote workVisa sponsorshipRelocation packageFlexible hoursDay shift
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Staff, Pre-Training Infra — Distributed ML Training. Be the first to apply!
- machine learning research scientist San Francisco, CA
- intern - quantum machine learning for quantum computing San Francisco, CA
- machine learning part time San Francisco, CA
- artificial intelligence - machine learning intern San Francisco, CA
- machine learning San Francisco, CA
- machine learning researcher San Francisco, CA
- machine learning intern San Francisco, CA
- data engineer machine learning San Francisco, CA
- machine learning scientist San Francisco, CA
- internship machine learning San Francisco, CA

