Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Staff ML Platform Engineer - Large Scale Training (LLMOps/MLOps)

TrueFoundry

Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)

We're TrueFoundry, and we're building the foundational infrastructure for production AI systems. We're looking for a Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) to join the team.

The Problem We're Solving

Companies are moving beyond simple chatbots to production agentic systems. These systems route between models. They integrate dozens of tools via protocols like MCP. They orchestrate multi-agent workflows where agents coordinate with other agents. The infrastructure to support this doesn't exist yet. You need a control plane that handles:

  • Intelligent routing with observability, cost policies, and fallback logic
  • Centralized tool and MCP server management with security and lifecycle controls
  • Agent orchestration with governance and guardrails
  • A unified compute layer to run self-hosted models, custom tools, and agents

We've built two products to solve this:

AI Gateway is the control plane, five composable components (Prompts, LLM Gateway, MCP Gateway, Guardrails, Agent Gateway) that handle routing, orchestration, and governance.

AI Deploy is the compute layer, a Kubernetes-based platform that abstracts ML workloads as standard software primitives, so everything runs on unified infrastructure.

We're Series A, backed by Intel Capital and Sequoia. Companies like CVS, Mastercard, Siemens, Paytm, Synopsys, and Zscaler run production AI workloads on our platform.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

What You'll Work On
  • Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance.
  • Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
  • Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
  • Build platform for developing, deploying and evaluating agentic applications for our end customers.
  • Help shape internal standards and best practices across the engineering team for high-scale ML workloads.
What We're Looking For
  • 5+ years of hands-on experience building and deploying ML systems at scale.
  • 5+ years of writing production quality high performance code.
  • Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework.
  • Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).
  • Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.
  • A pragmatic mindset—you know when to optimize and when to ship.
  • Bonus: Familiarity with open-source LLM training/fine-tuning.
Why Join TrueFoundry?
  • Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni.
  • First-hand exposure to building and scaling a deep-tech startup—insights you'll carry if you want to start your own one day.
  • Be part of a fearlessly experimental culture focused on customer success and long-term impact.
  • Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).
Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Staff ML Platform Engineer - Large Scale Training (LLMOps/MLOps) in San Francisco, CA vacancy
  • $180k - $250k

     ...ML Platform / MLOps Engineer Emeryville, California, United States; Hybrid (2-3 days on-site) Profluent is an AI-first...  ...engineers to enable reliable, scalable platforms for training, evaluating, and deploying large-scale generative biology models. As an early member... 
    Training

    Profluent

    Emeryville, CA
    2 days ago
  • $250k

     ...Consulting Ltd is looking for a talented ML/AI Research Engineer to join their San Francisco team...  ...the infrastructure that powers training, deployment, and governance of large-scale AI systems. The ideal candidate has a strong background in MLOps, Kubernetes, and cloud... 
    Training

    Alldus International Consulting Ltd

    San Francisco, CA
    3 days ago
  • $181.1k - $318.4k

    Apple Inc. is looking for a Staff ML Infrastructure Engineer in San Francisco to lead pre-training initiatives for cutting-edge foundation models in machine learning. The successful candidate will have over 6 years of experience in building scalable backend systems, be... 
    Training

    Apple Inc.

    San Francisco, CA
    2 days ago
  • PrismML is seeking a Staff-level AI/ML engineer to lead large-scale model training efforts. This role focuses on technical direction, mentoring engineers, and enhancing model quality and system performance. The ideal candidate will design, implement, and optimize distributed... 
    Training

    PrismML

    San Francisco, CA
    5 days ago
  •  ...organization in San Francisco seeks an Infrastructure Engineer to design and maintain large distributed ML training and inference clusters. The ideal candidate will...  ...like FSDP and DeepSpeed. Proficiency in cloud platforms and containerization is essential. Join us to... 
    Training

    Causal Labs

    San Francisco, CA
    4 days ago
  • A decentralized AI platform company in the United States is seeking an experienced ML Training Platform Engineer to design and build robust infrastructure for ML training. The...  ...conditions. This role is essential for enabling large-scale, collaborative AI development. #J-18808-... 
    Training

    Pluralis Research

    San Francisco, CA
    8 hours ago
  •  ...company in San Francisco is seeking a skilled ML Infrastructure Engineer to manage and optimize large-scale training systems. In this role, you will design and maintain...  ...with JAX, distributed training, and cloud platforms are essential for success in this hands-on position... 
    Training

    Physical Intelligence

    San Francisco, CA
    2 days ago
  •  ...these capabilities to consumer scale. Grounded in years of...  ...read on. What You'll Do Training Automation: Design and...  ...degree in Computer Science, Engineering, or equivalent practical experience...  ...in Software Engineering, MLOps, or ML Infrastructure ~ Strong Python... 
    Training
    Immediate start
    Relocation package
    Night shift

    AGI

    San Francisco, CA
    2 days ago
  •  ...need to be improved, engineers rely on data to understand...  ...We're looking for a ML Platform Engineer with deep...  ...help design, deploy, and scale the systems that power...  ...pipeline orchestration to training infrastructure and...  ...embedding pipelines over large, heterogeneous... 
    Training
    Remote work

    Foxglove Technologies, Inc

    San Francisco, CA
    3 days ago
  • $181.1k - $318.4k

     ...AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Infrastructure Apple is where individual imaginations...  ...will include: Drive large-scale pre-training initiatives to...  ...training. Architect a robust MLOps platform to streamline and... 
    Training
    Relocation

    Apple

    San Francisco, CA
    8 hours ago
  •  ...grade AI systems. As an MLOps Engineer, you will design,...  ...challenges and enabling ML teams to move faster....  ...pipelines. Automate model training, validation, and...  ...Biases, Kubeflow Cloud Platforms: AWS (SageMaker, S3, EC...  ...distributed systems and large-scale data processing.... 
    Training

    Sierracorp

    San Francisco, CA
    1 day ago
  • $250k

     ...is hiring a talented ML/AI Research Engineer to join their team in...  ...agents and models to be trained, evaluated, deployed and...  ...and governance of large‑scale AI systems. Build end...  ...years of experience in MLOps, ML infrastructure or backend/platform engineering. Proven experience... 
    Training

    Alldus International Consulting Ltd

    San Francisco, CA
    3 days ago
  •  ...Type On‑site What You’ll Do Training Automation: Design and...  ...degree in Computer Science, Engineering, or equivalent practical experience...  ...in Software Engineering, MLOps, or ML Infrastructure Experience building...  ...will define how experiments scale, how reliability is measured... 
    Training
    Full time
    Immediate start
    Relocation package
    Night shift

    AGI Inc

    San Francisco, CA
    1 day ago
  • $150k - $350k

     ...Description Job Description ML Engineer — Tilde Research...  ..., pretraining, scaling laws, and architecture...  ...systems, own end-to-end training pipelines, and operate...  ...architectures, debug large-scale training issues,...  ...This Role Is NOT For MLOps or infrastructure-focused... 
    Training

    David Joseph & Company

    San Francisco, CA
    2 days ago
  •  ...Machine Learning Engineer with 10+ years...  ...design, build, and scale production-grade...  ...on end-to-end ML system ownership...  ...engineering, model training, deployment,...  ...of scalable ML platforms, drive best practices in MLOps, and enable reliable...  ...involving large language models... 
    Training

    Gap Inc.

    San Francisco, CA
    2 days ago
  •  ...ML Ops Engineer — Agentic AI Lab (Founding Team) Location...  ...automating the model training, deployment,...  ...Infrastructure: ~4+ years in MLOps, ML platform engineering, or infra...  ...~ Experience with large model deployments (...  ...(spot instance scaling, batch prioritization... 
    Training
    Full time

    Fabrion

    San Francisco, CA
    2 days ago
  • A leading livestream shopping platform is looking for an AI/ML Platform Engineer to shape the future of AI and ML systems. This role involves designing the infrastructure...  ..., working alongside experts to deploy models at scale. Candidates should have extensive experience in... 
    Remote work
    Flexible hours

    Whatnot

    San Francisco, CA
    8 hours ago
  •  ...Accelerated AI Server Engineer Sygaldry...  ...speed up training and inference for...  ...combination of cost, scale, and speed...  ...manage the compute platform this team runs...  ...simulation, large-scale numerical...  ...Python-based ML and scientific...  ...experience MLops or research computing... 
    Training
    Casual work
    Local area
    Visa sponsorship

    Sygaldry

    San Francisco, CA
    2 days ago
  •  ...Machine Learning Engineer opportunities...  ...preprocessing, training, testing, and deployment...  ...end-to-end ML pipelines...  ...Implement robust MLOps practices such...  ...monitoring. Analyze large datasets to...  ...pipelines that scale reliably across...  ...model-serving platforms for LLM inference... 
    Training
    Flexible hours

    AI Chopping Block, Inc.

    San Francisco, CA
    3 days ago
  • $155.58k - $320.32k

    A leading social media platform is seeking a Senior MLOps Engineer to enhance their Connected TV ad-buying platform. The ideal candidate will scale machine learning practices, improve developer experiences, and provide technical leadership. Strong expertise in Linux, high... 

    Pinterest

    San Francisco, CA
    4 days ago
  • $147.4k - $272.1k

    Machine Learning Engineer — Large Language Models, Generative...  ...AI The Intelligence Platform team empowers clients...  ...quality inferences at scale! Description We are in...  ...is curiosity, strong ML fundamentals, and the...  ...Experience with model training, fine-tuning, or building... 
    Training
    Relocation

    Apple Inc.

    San Francisco, CA
    5 days ago
  • $246.5k - $339k

     ...technology wholesale platform built on the...  ...role As a Staff Machine Learning Platform Engineer, you will help design...  ...a scalable ML platform to accelerate model training, deployment, and...  ...critical team that scales Faire's ability...  ...of MLOps best practices:... 
    Training
    Work experience placement
    Work at office
    Local area
    Remote work
    Monday to Friday
    Flexible hours
    3 days per week

    Faire Inc

    San Francisco, CA
    1 day ago
  • $341k - $422k

     ...Partner 20, Applied ML, Engineer, ASG San Francisco, California, United...  ..., from feature engineering and training to large-scale, low-latency serving and robust MLOps infrastructure. The Applied...  ...ML infrastructure, influencing platform choices and building cloud-native... 
    Training
    Work at office
    2 days per week
    3 days per week

    Andreessen Horowitz

    San Francisco, CA
    2 days ago
  • $250k - $300k

     ...AI intelligence platform that restores...  ...every quarter. Our engineering roles are...  ...convert it into training signal. End-to-...  ...customization at scale. Model Serving...  ...3+ focused on ML infrastructure,...  ...systems ~ Staff-level scope: owned...  ...Eng sides of MLOps ~ Track record... 
    Training
    Work at office
    Immediate start
    Remote work
    Flexible hours

    Ambience Healthcare

    San Francisco, CA
    2 days ago
  •  ...in San Francisco, is seeking an AI Platform Engineer to manage and optimize the training and inference of AI models. You...  ...candidate has a solid foundation in ML engineering, particularly with Ray...  ...and experience in production-level MLOps. Competitive salary and... 
    Training

    Medium

    San Francisco, CA
    3 days ago
  •  ...take on a hands-on role focused on scaling and optimizing ML training systems. Key responsibilities include...  ...will have strong software engineering foundations, hands-on experience in...  ...PyTorch, and familiarity with cloud platforms. This position provides a unique opportunity... 
    Training

    Physical Intelligence

    San Francisco, CA
    3 days ago
  • $230k - $310k

    A leading tech company in San Francisco is seeking a Staff Engineer to lead critical backend initiatives. This role involves architecting...  ...and strong expertise in event streaming systems and large-scale APIs. The position offers a competitive salary ranging from... 
    Work at office
    Remote work

    Gamma

    San Francisco, CA
    4 days ago
  • Staff Machine Learning Engineer, Listings and Host Tools Data and AI Airbnb was born...  ...Intelligence Machine Learning (ULM-ML) team: The ULM-ML team...  ...A Typical Day: Work with large scale structured and...  ...Learning best practices (eg. training/serving skew minimization,... 
    Training
    Work experience placement

    airbnb, Inc.

    San Francisco, CA
    2 days ago
  •  ...Lila is building a platform where AI and...  ...seeking a Principal ML Research Engineer to be the...  ...Cell- and tissue-scale biology sits at an...  ...where warranted) training of domain-specific...  ..., observability, MLOps practice - and mentor...  ...JAX/TensorFlow); large-scale data loading... 
    Training

    Lila Sciences

    San Francisco, CA
    2 days ago
  • $197.3k - $225.1k

     ...Lead AI/ML Engineer (Platform, kubeflow) Overview At Capital One, we are creating responsible...  ...including foundation model training, large language model inference, similarity...  ...cost, latency, throughput — of large scale production AI systems. ~ Contribute... 
    Training
    Full time
    Part time
    Local area

    Capital One Financial Corp

    San Francisco, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff ML Platform Engineer - Large Scale Training (LLMOps/MLOps). Be the first to apply!