Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Staff, Pre-Training Infra — Distributed ML Training

B Capital

B Capital is seeking a talented engineer in San Francisco to build and scale distributed training systems for machine learning models. The ideal candidate will have strong experience in distributed training frameworks, debug GPU compute systems, and optimize training throughput. This role offers top-tier compensation, comprehensive health benefits, and the opportunity to work in a collaborative environment with daily meals and team celebrations. #J-18808-Ljbffr B Capital

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Staff, Pre-Training Infra — Distributed ML Training in San Francisco, CA vacancy
  • A leading technology firm in San Francisco is seeking a candidate to build and scale distributed training systems for large model pre-training. You will collaborate with research teams to design and operate training runs and enhance performance across distributed training... 
    Training

    Reflection

    San Francisco, CA
    5 days ago
  • $150k - $300k

     ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...As a Research Engineer working on Distributed Training, you'll play a crucial role...  ...with the latest advancements in AI/ML infrastructure and tools, decentralized... 
    Training
    Remote work
    Worldwide
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime-Intellect

    San Francisco, CA
    3 days ago
  •  ...Anthropic and beyond. About the Role Build and scale distributed training systems that power frontier model pre-training. Work closely with research teams to...  ...distributed workloads. Experience working closely with ML researchers to productionize experimental training... 
    Training
    Full time
    Relocation package

    B Capital

    San Francisco, CA
    5 days ago
  • Staff Software Engineer, ML Infra & Distributed Systems About the Role: As a Staff Software Engineer on the ML Infrastructure team, you will collaborate...  ...feature stores (e.g. Feast) Understanding of ML model training pipelines and model internals. Experience with... 
    Training

    Tubi Tv

    San Francisco, CA
    13 days ago
  •  ...to build scalable infrastructure for large‑scale training and fine-tuning of foundation models. You will design distributed training systems and optimize GPU utilization...  ...Ideal candidates have over 5 years of experience in ML infrastructure and a strong background in... 
    Training

    Baseten

    San Francisco, CA
    2 days ago
  • $212k - $292k

    ## Senior Staff AI ResearcherSan Francisco, California,United StatesApply...  ...engineering, such as AI/ML, algorithms, digital signal...  ...vision, data science & analytics, distributed systems, cloud, edge & mobile...  ..., and relevant education or training. Your recruiter can share more... 
    Training
    Full time
    Local area
    Worldwide
    Flexible hours

    Via Licensing Corporation

    San Francisco, CA
    5 days ago
  • $325k

     ...good fit if you: Have strong distributed systems, infrastructure, or reliability...  ...large-scale model serving or training infrastructure (>1000 GPUs)...  ...experience with one or more ML hardware accelerators (GPUs,...  ...: Currently, we expect all staff to be in one of our offices at... 
    Training
    Visa sponsorship

    United States Digital Space LLC

    San Francisco, CA
    2 days ago
  • Dormont Manufacturing Co is looking for a Software Engineer for their Pre-training Systems team in San Francisco. Your primary role will be to design and maintain the distributed infrastructure that trains long-context models at scale, tackling challenges related to memory... 
    Training

    Dormont Manufacturing Co

    San Francisco, CA
    2 days ago
  •  ...firm in San Francisco is seeking an AI Infra Engineer to enhance their infrastructure...  ...Kubernetes clusters and manage Slurm for distributed training. Important skills include extensive experience...  ...team aiming at advancements in AI and ML infrastructure. #J-18808-Ljbffr... 
    Training

    Perplexity

    San Francisco, CA
    3 days ago
  • $320k

     ...high‑performance, large‑scale distributed systems serving millions of users...  ...serving; prior inference or ML experience is not required....  ...equivalent combination of education, training, and/or experience. Required...  ...: Currently, we expect all staff to be in one of our offices at... 
    Training
    Visa sponsorship

    United States Digital Space LLC

    San Francisco, CA
    6 days ago
  • $117.2k - $313.7k

     ...new and exciting components/frameworks in distributed filesystems in an ever-growing and...  ...Design patterns & Experience with Big-Data/ML and S3 Hands-on experience with Streaming...  ...assignment, compensation, promotion, benefits, training, assessment of job performance,... 
    Training
    Immediate start
    Remote work

    Salesforce, Inc.

    San Francisco, CA
    5 days ago
  • $150k - $300k

     ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...including SLURM and Kubernetes for distributed workloads Implement high‑performance...  ...FSDP, DeepSpeed, Megatron‑LM) ML framework optimization and profiling... 
    Training

    Prime Intellect

    San Francisco, CA
    6 days ago
  • $181.1k - $318.4k

    Apple Inc. is looking for a Staff ML Infrastructure Engineer in San Francisco to lead pre-training initiatives for cutting-edge foundation models in machine learning...  ...and Go, and possess strong knowledge of distributed systems and containerization. The role offers... 
    Training

    Apple

    San Francisco, CA
    4 days ago
  • $281k - $356k

     ...and work with partners to scale eval and ML development by exploring new methods and delivering...  ...hybrid role, you will report to a Senior Staff manager. You will: Develop tools for...  ...exact work location, experience, relevant training and education, and skill level. Your... 
    Training
    Full time
    Work experience placement
    Remote work

    Waymo

    San Francisco, CA
    5 days ago
  •  ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...for synthetic data generation and distributed RL frameworks Publish research in...  ...with the latest advancements in AI/ML infrastructure and tools, synthetic... 
    Training
    Remote work
    Worldwide
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime-Intellect

    San Francisco, CA
    2 days ago
  •  ...pipelines. Design and implement scalable data pipelines for model training, inference, and monitoring, ensuring low latency and high...  ...SageMaker). Proven track record building and scaling Kubernetes‑based ML‑ops platforms. Strong knowledge of data engineering, ETL, and... 
    Training

    Gravity Engineering Services Pvt Ltd.

    San Francisco, CA
    5 days ago
  •  ...Staff Software Engineer, Listings & Host Tools and AI Airbnb was born in 2007 when two...  ...standards etc. We own data pipelines and ML models and will build services for serving...  ...is dependent upon many factors, such as: training, transferable skills, work experience, business... 
    Training
    Work experience placement

    airbnb, Inc.

    San Francisco, CA
    6 hours ago
  • $225k

     ...alone. Our approach combines frontier-scale pre-training, domain-specific RL, ultra-long context,...  ...team, you will design and operate the distributed infrastructure that trains Magic’s long-...  ...cross-layer issues in production ML systems Strong ownership mindset and ability... 
    Training
    Relocation
    Visa sponsorship

    Magic

    San Francisco, CA
    3 days ago
  • $192k - $260k

     ...build the most trusted data analytics and ML platform in the world. We’re looking to...  ...language (preferably Python) ~ Experience with distributed data processing systems like Spark and...  ...experience, relevant certifications and training, and specific work location. Based on the... 
    Training
    Remote job
    Local area
    Worldwide

    Databricks

    San Francisco, CA
    more than 2 months ago
  •  ...will help scale and optimize our training systems and core model code....  ...role at the intersection of ML, software engineering, and scalable...  ..., and metrics/logging. Scale distributed training: Work with...  ...Translate research needs into infra capabilities and guide best practices... 
    Training
    Full time

    Monograph

    San Francisco, CA
    4 days ago
  • $150k - $300k

     ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...meet throughput/latency SLOs. Model Distribution: Optimize model distribution and...  ...Requirements Required Experience Building ML Systems at Scale: 3+ years building... 
    Training
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours
    Shift work

    Prime Intellect

    San Francisco, CA
    2 days ago
  • $150k - $300k

     ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...workload management You will work on a distributed system with performance engineering...  ...Experience with GPU computing and ML infrastructure Knowledge of AI/ML... 
    Training
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime Intellect, Inc.

    San Francisco, CA
    3 days ago
  • $180k - $350k

     ...from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and...  ...: the hosted RL training platform, distributed GPU infrastructure, liquid compute...  ...Experience securing GPU infrastructure or ML training pipelines Background in... 
    Training
    Work at office
    Remote work
    Visa sponsorship
    Relocation package
    Flexible hours

    Prime Intellect, Inc.

    San Francisco, CA
    5 days ago
  •  ...formats like PDFs and spreadsheets. We train vision models to read those documents the...  ...our core product. The Opportunity As an ML Infra Engineer, you’ll play a key role in...  ...strong reliability and observability. Scale distributed training and inference workloads across... 
    Training
    Work at office

    Reducto, Inc.

    San Francisco, CA
    6 days ago
  • $208k - $276k

     ...Solutions that address customer needs. Responsibilities Pre-sales solutions consulting : The role involves...  ...AWS. Proven experience designing and implementing AI/ML solutions, including model development, training, and deployment, with a deep understanding of frameworks... 
    Training
    Full time
    Contract work

    Dormont Manufacturing Co

    San Bruno, CA
    2 days ago
  • $190k - $250k

     ...comprehensive platform to manage heart disease. As a Staff Data Architect, you will lead the data...  ...systems through curated analytical and ML‑ready datasets. Advance the Semantic...  ...Heartflow, including recruitment, hiring, training, relocation, promotion, and termination.... 
    Training
    Work experience placement
    Local area
    Worldwide
    Relocation

    HeartFlow

    San Francisco, CA
    3 days ago
  • $320k - $405k

     ...beneficial AI systems. Staff Infrastructure Engineer, Node Infra About the role the company...  ...determine how quickly we can train new models, how reliably...  ...Deep expertise in distributed systems, reliability, and...  ...InfiniBand) for distributed ML workloads. Demonstrated... 
    Training
    Visa sponsorship

    United States Digital Space LLC

    San Francisco, CA
    2 days ago
  • $300k

     ...Key responsibilities Build and maintain distributed inference systems. Design request...  ...an equivalent combination of education, training, and experience. Required field of study...  ...position. Location‑based hybrid policy: staff to be in one office at least 25% of the... 
    Training
    Work at office
    Worldwide
    Visa sponsorship

    United States Digital Space LLC

    San Francisco, CA
    2 days ago
  • $320k

     ...partners Strong Candidates May Also Have Experience with ML inference or training infrastructure deployment, particularly across multiple accelerator...  ...Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time.... 
    Training
    Visa sponsorship
    Shift work

    United States Digital Space LLC

    San Francisco, CA
    2 days ago
  •  ...ABOUT THE ROLE You'll build and maintain the ML systems and pipelines that our research runs on top of: data pipelines, training infrastructure, evaluation tooling,...  ...evaluation, deployment, monitoring Strong distributed systems experience; you've shipped systems... 
    Training

    MakerMaker.AI

    San Francisco, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff, Pre-Training Infra — Distributed ML Training. Be the first to apply!