Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Staff, Pre-Training Infra — Distributed ML Training

B Capital

B Capital is seeking a talented engineer in San Francisco to build and scale distributed training systems for machine learning models. The ideal candidate will have strong experience in distributed training frameworks, debug GPU compute systems, and optimize training throughput. This role offers top-tier compensation, comprehensive health benefits, and the opportunity to work in a collaborative environment with daily meals and team celebrations. #J-18808-Ljbffr B Capital

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Staff, Pre-Training Infra — Distributed ML Training in San Francisco, CA vacancy
  • A leading technology firm in San Francisco is seeking a candidate to build and scale distributed training systems for large model pre-training. You will collaborate with research teams to design and operate training runs and enhance performance across distributed training... 
    Training

    Reflection

    San Francisco, CA
    4 days ago
  •  ...research organization is seeking an expert to build and scale distributed training systems for large machine learning models. The role involves...  ...training frameworks and excellent debugging skills for large-scale ML pipelines. This position offers top-tier compensation and... 
    Training

    AI Chopping Block, Inc.

    San Francisco, CA
    1 day ago
  • $270k - $340k

     ..., with technologies ranging from post‑training open source LLMs to developing advanced...  ...‑art approaches. Optimize end‑to‑end ML systems for distributed training and RL, memory efficiency,...  ...Collaborate with distributed systems and infra teams to push the limits of... 
    Training
    Local area
    Worldwide

    I did my part and supported the Regular Toilet

    San Francisco, CA
    4 days ago
  • $150k - $300k

     ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...As a Research Engineer working on Distributed Training, you'll play a crucial role...  ...with the latest advancements in AI/ML infrastructure and tools, decentralized... 
    Training
    Remote work
    Worldwide
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime-Intellect

    San Francisco, CA
    2 days ago
  •  ...Anthropic and beyond. About the Role Build and scale distributed training systems that power frontier model pre-training. Work closely with research teams to...  ...distributed workloads. Experience working closely with ML researchers to productionize experimental training... 
    Training
    Relocation package

    Reflection

    San Francisco, CA
    4 days ago
  • $227.2k - $417k

     ...Software Engineer, ML Infra & Distributed Systems (Staff & Principal) San Francisco, CA; Los Angeles, CA; New York, NY (Hybrid); USA - Remote...  ...FAISS), feature stores (e.g. Feast), ElastiCache, model training orchestration, etc. Understanding of ML model training... 
    Training
    Full time
    Temporary work
    Local area
    Remote work
    Flexible hours

    Tubi

    San Francisco, CA
    3 days ago
  •  ...to build scalable infrastructure for large‑scale training and fine-tuning of foundation models. You will design distributed training systems and optimize GPU utilization...  ...Ideal candidates have over 5 years of experience in ML infrastructure and a strong background in... 
    Training

    Baseten

    San Francisco, CA
    1 day ago
  •  ...intelligence to serve humanity. We’re training and deploying frontier models for...  ...typical "Applied Scientist" or "ML Engineer" role. As a Member of Technical Staff, Applied ML, you will: Work...  ...Impact Mentor engineers across distributed teams. Drive clarity in ambiguous... 
    Training
    Full time
    Work at office
    Remote work
    Flexible hours

    Cohere

    San Francisco, CA
    3 days ago
  • ML Systems Engineer - Robotics & AI We are building the full-stack...  ...handling scenarios unseen in training. We work at the intersection...  ...manufacturing scale-up. We are hiring a Staff/Principal ML Systems Engineer...  .... Drive measurable gains in distributed efficiency, compute efficiency... 
    Training

    Maxwell Bond

    San Francisco, CA
    3 days ago
  • $325k

     ...fit if you: Have strong distributed systems, infrastructure, or...  ...large-scale model serving or training infrastructure ( ~1000 GPUs)...  ...Have experience with one or more ML hardware accelerators (GPUs, TPUs...  ...: Currently, we expect all staff to be in one of our offices at... 
    Training
    Work at office
    Visa sponsorship
    Flexible hours

    anthropic

    San Francisco, CA
    3 days ago
  • $117.2k - $313.7k

     ...new and exciting components/frameworks in distributed filesystems in an ever-growing and...  ...Design patterns & Experience with Big-Data/ML and S3 Hands-on experience with Streaming...  ...assignment, compensation, promotion, benefits, training, assessment of job performance,... 
    Training
    Immediate start
    Remote work

    Salesforce

    San Francisco, CA
    4 days ago
  •  ...firm in San Francisco is seeking an AI Infra Engineer to enhance their infrastructure...  ...Kubernetes clusters and manage Slurm for distributed training. Important skills include extensive experience...  ...team aiming at advancements in AI and ML infrastructure. #J-18808-Ljbffr... 
    Training

    Perplexity

    San Francisco, CA
    2 days ago
  • $200k - $280k

     ...architectures, engines) and post-training / RL systems. We build...  ...), GPU performance, distributed serving. RL-first...  ...computing for ML. Are comfortable...  ...enjoy collaborating with infra, research, and product...  ...technical leadership (Staff level) Set technical... 
    Training
    Full time

    Together AI

    San Francisco, CA
    4 days ago
  •  ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...for synthetic data generation and distributed RL frameworks. Publish research in...  ...with the latest advancements in AI/ML infrastructure and tools, synthetic... 
    Training
    Remote work
    Worldwide
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime-Intellect

    San Francisco, CA
    3 days ago
  • $181.1k - $318.4k

    Apple Inc. is looking for a Staff ML Infrastructure Engineer in San Francisco to lead pre-training initiatives for cutting-edge foundation models in machine learning...  ...and Go, and possess strong knowledge of distributed systems and containerization. The role offers... 
    Training

    Apple Inc.

    San Francisco, CA
    3 days ago
  •  ...ML Infrastructure Engineer In this role you will help scale and optimize our training systems and core model code. You'll own critical infrastructure...  ...metrics/logging. Scale distributed training: Work with...  ...Translate research needs into infra capabilities and guide best... 
    Training

    Physical Intelligence

    San Francisco, CA
    15 hours ago
  • $233.5k - $350.5k

     ...GoFundMe team is searching for our next Senior Staff Data Platform Architect to help design...  ...decision-making, experimentation, and AI/ML across GoFundMe. This is a highly...  ...including skills, experience, education, or training. Your recruiter can share more about the specific... 
    Training
    Full time
    Work at office
    Flexible hours

    GoFundMe

    San Francisco, CA
    15 hours ago
  • $190k - $250k

     ...comprehensive platform to manage heart disease. As a Staff Data Architect, you will lead the data...  ...systems through curated analytical and ML-ready datasets. Advance the Semantic...  ...Heartflow, including recruitment, hiring, training, relocation, promotion, and termination.... 
    Training
    Work experience placement
    Local area
    Worldwide
    Relocation

    HeartFlow

    San Francisco, CA
    1 day ago
  • $150k - $300k

     ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...meet throughput/latency SLOs. Model Distribution: Optimize model distribution and...  ...Requirements Required Experience Building ML Systems at Scale: 3+ years building... 
    Training
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours
    Shift work

    Prime Intellect

    San Francisco, CA
    1 day ago
  •  ...something! Description As an engineer on ML Compute team, your work will include: Drive large-scale pre-training initiatives to support cutting-edge...  ...scalability, and resource optimization. Enhance distributed training techniques for foundation models.... 
    Training

    Apple

    San Francisco, CA
    1 day ago
  • $180k - $350k

     ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...: the hosted RL training platform, distributed GPU infrastructure, liquid compute...  ...Experience securing GPU infrastructure or ML training pipelines Background in... 
    Training
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime Intellect, Inc.

    San Francisco, CA
    4 days ago
  • $150k - $300k

     ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...workload management You will work on a distributed system with performance engineering...  ...Experience with GPU computing and ML infrastructure Knowledge of AI/ML... 
    Training
    Full time
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours

    Kubelt

    San Francisco, CA
    2 days ago
  • $225k

     ...alone. Our approach combines frontier-scale pre-training, domain-specific RL, ultra-long context,...  ...team, you will design and operate the distributed infrastructure that trains Magic’s long-...  ...cross-layer issues in production ML systems Strong ownership mindset and ability... 
    Training
    Relocation
    Visa sponsorship

    Magic

    San Francisco, CA
    2 days ago
  • $320k

     ...Strong Candidates May Also Have Experience with ML inference or training infrastructure deployment, particularly across multiple accelerator...  ...Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time.... 
    Training
    Work at office
    Visa sponsorship
    Flexible hours
    Shift work

    anthropic

    San Francisco, CA
    1 day ago
  • $208k - $276k

    Staff Solutions Architect, Data Solutions page is loaded## Staff Solutions...  ...needs. # **Responsibilities*** **Pre-sales solutions consulting**: The...  ...experience designing and implementing AI/ML solutions, including model development, training, and deployment, with a deep... 
    Training
    Full time
    Contract work

    Verily

    San Bruno, CA
    2 days ago
  •  ...purpose AI for the physical world. Training our models requires orchestrating...  ..., and making large-scale distributed training seamless. The Team The ML Infrastructure team supports and...  ...fast. You will work closely with ML Infra (training systems), data platform... 
    Training
    Flexible hours

    Physical Intelligence

    San Francisco, CA
    15 hours ago
  •  ...formats like PDFs and spreadsheets. We train vision models to read those documents the...  ...product. The Opportunity As an ML Infra Engineer , you'll play a key role in...  ...reliability and observability. Scale distributed training and inference workloads across... 
    Training
    Work at office
    Local area

    Reducto

    San Francisco, CA
    2 days ago
  • $320k - $405k

     ...beneficial AI systems. Staff Infrastructure Engineer, Node Infra About the role...  ...determine how quickly we can train new models, how reliably...  ...Deep expertise in distributed systems, reliability, and...  ...InfiniBand) for distributed ML workloads. ~ Demonstrated... 
    Training
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    9 days ago
  • A leading AI research company in San Francisco seeks Senior/Staff Engineers skilled in distributed systems and large-scale ML training. Responsibilities include designing systems optimized for low-bandwidth conditions and implementing robust training strategies. Ideal... 
    Training
    Remote work

    Pluralis Research

    San Francisco, CA
    1 day ago
  • $160k - $200k

     ...superintelligence stack - from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and orchestrate global...  ...and discussing cluster performance with a customer's ML infrastructure team in the morning, and partnering with... 
    Training
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours
    Day shift

    Prime Intellect

    San Francisco, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff, Pre-Training Infra — Distributed ML Training. Be the first to apply!