Staff, Pre-Training Infra — Distributed ML Training
B Capital
B Capital is seeking a talented engineer in San Francisco to build and scale distributed training systems for machine learning models. The ideal candidate will have strong experience in distributed training frameworks, debug GPU compute systems, and optimize training throughput. This role offers top-tier compensation, comprehensive health benefits, and the opportunity to work in a collaborative environment with daily meals and team celebrations. #J-18808-Ljbffr B Capital
- A leading technology firm in San Francisco is seeking a candidate to build and scale distributed training systems for large model pre-training. You will collaborate with research teams to design and operate training runs and enhance performance across distributed training...Training
$150k - $300k
...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...As a Research Engineer working on Distributed Training, you'll play a crucial role... ...with the latest advancements in AI/ML infrastructure and tools, decentralized...TrainingRemote workWorldwideVisa sponsorshipRelocation packageFlexible hours- ...Anthropic and beyond. About the Role Build and scale distributed training systems that power frontier model pre-training. Work closely with research teams to... ...distributed workloads. Experience working closely with ML researchers to productionize experimental training...TrainingFull timeRelocation package
- Staff Software Engineer, ML Infra & Distributed Systems About the Role: As a Staff Software Engineer on the ML Infrastructure team, you will collaborate... ...feature stores (e.g. Feast) Understanding of ML model training pipelines and model internals. Experience with...Training
- ...to build scalable infrastructure for large‑scale training and fine-tuning of foundation models. You will design distributed training systems and optimize GPU utilization... ...Ideal candidates have over 5 years of experience in ML infrastructure and a strong background in...Training
$212k - $292k
## Senior Staff AI ResearcherSan Francisco, California,United StatesApply... ...engineering, such as AI/ML, algorithms, digital signal... ...vision, data science & analytics, distributed systems, cloud, edge & mobile... ..., and relevant education or training. Your recruiter can share more...TrainingFull timeLocal areaWorldwideFlexible hours$325k
...good fit if you: Have strong distributed systems, infrastructure, or reliability... ...large-scale model serving or training infrastructure (>1000 GPUs)... ...experience with one or more ML hardware accelerators (GPUs,... ...: Currently, we expect all staff to be in one of our offices at...TrainingVisa sponsorship- Dormont Manufacturing Co is looking for a Software Engineer for their Pre-training Systems team in San Francisco. Your primary role will be to design and maintain the distributed infrastructure that trains long-context models at scale, tackling challenges related to memory...Training
- ...firm in San Francisco is seeking an AI Infra Engineer to enhance their infrastructure... ...Kubernetes clusters and manage Slurm for distributed training. Important skills include extensive experience... ...team aiming at advancements in AI and ML infrastructure. #J-18808-Ljbffr...Training
$320k
...high‑performance, large‑scale distributed systems serving millions of users... ...serving; prior inference or ML experience is not required.... ...equivalent combination of education, training, and/or experience. Required... ...: Currently, we expect all staff to be in one of our offices at...TrainingVisa sponsorship$117.2k - $313.7k
...new and exciting components/frameworks in distributed filesystems in an ever-growing and... ...Design patterns & Experience with Big-Data/ML and S3 Hands-on experience with Streaming... ...assignment, compensation, promotion, benefits, training, assessment of job performance,...TrainingImmediate startRemote work$150k - $300k
...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...including SLURM and Kubernetes for distributed workloads Implement high‑performance... ...FSDP, DeepSpeed, Megatron‑LM) ML framework optimization and profiling...Training$181.1k - $318.4k
Apple Inc. is looking for a Staff ML Infrastructure Engineer in San Francisco to lead pre-training initiatives for cutting-edge foundation models in machine learning... ...and Go, and possess strong knowledge of distributed systems and containerization. The role offers...Training$281k - $356k
...and work with partners to scale eval and ML development by exploring new methods and delivering... ...hybrid role, you will report to a Senior Staff manager. You will: Develop tools for... ...exact work location, experience, relevant training and education, and skill level. Your...TrainingFull timeWork experience placementRemote work- ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...for synthetic data generation and distributed RL frameworks Publish research in... ...with the latest advancements in AI/ML infrastructure and tools, synthetic...TrainingRemote workWorldwideVisa sponsorshipRelocation packageFlexible hours
- ...pipelines. Design and implement scalable data pipelines for model training, inference, and monitoring, ensuring low latency and high... ...SageMaker). Proven track record building and scaling Kubernetes‑based ML‑ops platforms. Strong knowledge of data engineering, ETL, and...Training
- ...Staff Software Engineer, Listings & Host Tools and AI Airbnb was born in 2007 when two... ...standards etc. We own data pipelines and ML models and will build services for serving... ...is dependent upon many factors, such as: training, transferable skills, work experience, business...TrainingWork experience placement
$225k
...alone. Our approach combines frontier-scale pre-training, domain-specific RL, ultra-long context,... ...team, you will design and operate the distributed infrastructure that trains Magic’s long-... ...cross-layer issues in production ML systems Strong ownership mindset and ability...TrainingRelocationVisa sponsorship$192k - $260k
...build the most trusted data analytics and ML platform in the world. We’re looking to... ...language (preferably Python) ~ Experience with distributed data processing systems like Spark and... ...experience, relevant certifications and training, and specific work location. Based on the...TrainingRemote jobLocal areaWorldwide- ...will help scale and optimize our training systems and core model code.... ...role at the intersection of ML, software engineering, and scalable... ..., and metrics/logging. Scale distributed training: Work with... ...Translate research needs into infra capabilities and guide best practices...TrainingFull time
$150k - $300k
...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...meet throughput/latency SLOs. Model Distribution: Optimize model distribution and... ...Requirements Required Experience Building ML Systems at Scale: 3+ years building...TrainingWork at officeRemote workVisa sponsorshipRelocation packageFlexible hoursShift work$150k - $300k
...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...workload management You will work on a distributed system with performance engineering... ...Experience with GPU computing and ML infrastructure Knowledge of AI/ML...TrainingWork at officeRemote workVisa sponsorshipRelocation packageFlexible hours$180k - $350k
...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and... ...: the hosted RL training platform, distributed GPU infrastructure, liquid compute... ...Experience securing GPU infrastructure or ML training pipelines Background in...TrainingWork at officeRemote workVisa sponsorshipRelocation packageFlexible hours- ...formats like PDFs and spreadsheets. We train vision models to read those documents the... ...our core product. The Opportunity As an ML Infra Engineer, you’ll play a key role in... ...strong reliability and observability. Scale distributed training and inference workloads across...TrainingWork at office
$208k - $276k
...Solutions that address customer needs. Responsibilities Pre-sales solutions consulting : The role involves... ...AWS. Proven experience designing and implementing AI/ML solutions, including model development, training, and deployment, with a deep understanding of frameworks...TrainingFull timeContract work$190k - $250k
...comprehensive platform to manage heart disease. As a Staff Data Architect, you will lead the data... ...systems through curated analytical and ML‑ready datasets. Advance the Semantic... ...Heartflow, including recruitment, hiring, training, relocation, promotion, and termination....TrainingWork experience placementLocal areaWorldwideRelocation$320k - $405k
...beneficial AI systems. Staff Infrastructure Engineer, Node Infra About the role the company... ...determine how quickly we can train new models, how reliably... ...Deep expertise in distributed systems, reliability, and... ...InfiniBand) for distributed ML workloads. Demonstrated...TrainingVisa sponsorship$300k
...Key responsibilities Build and maintain distributed inference systems. Design request... ...an equivalent combination of education, training, and experience. Required field of study... ...position. Location‑based hybrid policy: staff to be in one office at least 25% of the...TrainingWork at officeWorldwideVisa sponsorship$320k
...partners Strong Candidates May Also Have Experience with ML inference or training infrastructure deployment, particularly across multiple accelerator... ...Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time....TrainingVisa sponsorshipShift work- ...ABOUT THE ROLE You'll build and maintain the ML systems and pipelines that our research runs on top of: data pipelines, training infrastructure, evaluation tooling,... ...evaluation, deployment, monitoring Strong distributed systems experience; you've shipped systems...Training
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Staff, Pre-Training Infra — Distributed ML Training. Be the first to apply!
- machine learning remote San Francisco, CA
- machine learning research scientist San Francisco, CA
- machine learning San Francisco, CA
- artificial intelligence - machine learning intern San Francisco, CA
- intern - quantum machine learning for quantum computing San Francisco, CA
- machine learning part time San Francisco, CA
- data engineer machine learning San Francisco, CA
- machine learning scientist San Francisco, CA
- internship machine learning San Francisco, CA
- machine learning researcher San Francisco, CA

