Staff ML Platform Engineer - Large Scale Training (LLMOps/MLOps)

TrueFoundry

Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)

We're TrueFoundry, and we're building the foundational infrastructure for production AI systems. We're looking for a Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) to join the team.

The Problem We're Solving

Companies are moving beyond simple chatbots to production agentic systems. These systems route between models. They integrate dozens of tools via protocols like MCP. They orchestrate multi-agent workflows where agents coordinate with other agents. The infrastructure to support this doesn't exist yet. You need a control plane that handles:

Intelligent routing with observability, cost policies, and fallback logic
Centralized tool and MCP server management with security and lifecycle controls
Agent orchestration with governance and guardrails
A unified compute layer to run self-hosted models, custom tools, and agents

We've built two products to solve this:

AI Gateway is the control plane, five composable components (Prompts, LLM Gateway, MCP Gateway, Guardrails, Agent Gateway) that handle routing, orchestration, and governance.

AI Deploy is the compute layer, a Kubernetes-based platform that abstracts ML workloads as standard software primitives, so everything runs on unified infrastructure.

We're Series A, backed by Intel Capital and Sequoia. Companies like CVS, Mastercard, Siemens, Paytm, Synopsys, and Zscaler run production AI workloads on our platform.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

What You'll Work On

Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance.
Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
Build platform for developing, deploying and evaluating agentic applications for our end customers.
Help shape internal standards and best practices across the engineering team for high-scale ML workloads.

What We're Looking For

5+ years of hands-on experience building and deploying ML systems at scale.
5+ years of writing production quality high performance code.
Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework.
Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).
Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.
A pragmatic mindset—you know when to optimize and when to ship.
Bonus: Familiarity with open-source LLM training/fine-tuning.

Why Join TrueFoundry?

Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni.
First-hand exposure to building and scaling a deep-tech startup—insights you'll carry if you want to start your own one day.
Be part of a fearlessly experimental culture focused on customer success and long-term impact.
Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Staff ML Platform Engineer - Large Scale Training (LLMOps/MLOps) in San Francisco, CA vacancy

ML Platform / MLOps Engineer
$180k - $250k
...ML Platform / MLOps Engineer Emeryville, California, United States; Hybrid (2-3 days on-site) Profluent is an AI-first... ...engineers to enable reliable, scalable platforms for training, evaluating, and deploying large-scale generative biology models. As an early member...
Training
Profluent
Emeryville, CA
2 days ago
Founding MLOps Engineer — Scale LLMs & Secure AI Infra
$250k
...Consulting Ltd is looking for a talented ML/AI Research Engineer to join their San Francisco team... ...the infrastructure that powers training, deployment, and governance of large-scale AI systems. The ideal candidate has a strong background in MLOps, Kubernetes, and cloud...
Training
Alldus International Consulting Ltd
San Francisco, CA
3 days ago
Staff ML Infra Engineer: Large-Scale Pretraining & MLOps
$181.1k - $318.4k
Apple Inc. is looking for a Staff ML Infrastructure Engineer in San Francisco to lead pre-training initiatives for cutting-edge foundation models in machine learning. The successful candidate will have over 6 years of experience in building scalable backend systems, be...
Training
Apple Inc.
San Francisco, CA
2 days ago
Staff AI/ML Engineer: Large-Scale Training Systems
PrismML is seeking a Staff-level AI/ML engineer to lead large-scale model training efforts. This role focuses on technical direction, mentoring engineers, and enhancing model quality and system performance. The ideal candidate will design, implement, and optimize distributed...
Training
PrismML
San Francisco, CA
5 days ago
ML Infrastructure Engineer — Large-Scale AI Systems
...organization in San Francisco seeks an Infrastructure Engineer to design and maintain large distributed ML training and inference clusters. The ideal candidate will... ...like FSDP and DeepSpeed. Proficiency in cloud platforms and containerization is essential. Join us to...
Training
Causal Labs
San Francisco, CA
4 days ago
ML Training Platform Engineer | Multi-Cloud & Decentralized
A decentralized AI platform company in the United States is seeking an experienced ML Training Platform Engineer to design and build robust infrastructure for ML training. The... ...conditions. This role is essential for enabling large-scale, collaborative AI development. #J-18808-...
Training
Pluralis Research
San Francisco, CA
8 hours ago
ML Training Infra Engineer — JAX/TPU & Scale
...company in San Francisco is seeking a skilled ML Infrastructure Engineer to manage and optimize large-scale training systems. In this role, you will design and maintain... ...with JAX, distributed training, and cloud platforms are essential for success in this hands-on position...
Training
Physical Intelligence
San Francisco, CA
2 days ago
ML Platform & Infrastructure Engineer
...these capabilities to consumer scale. Grounded in years of... ...read on. What You'll Do Training Automation: Design and... ...degree in Computer Science, Engineering, or equivalent practical experience... ...in Software Engineering, MLOps, or ML Infrastructure ~ Strong Python...
Training
Immediate start
Relocation package
Night shift
AGI
San Francisco, CA
2 days ago
ML Platform Engineer
...need to be improved, engineers rely on data to understand... ...We're looking for a ML Platform Engineer with deep... ...help design, deploy, and scale the systems that power... ...pipeline orchestration to training infrastructure and... ...embedding pipelines over large, heterogeneous...
Training
Remote work
Foxglove Technologies, Inc
San Francisco, CA
3 days ago
AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Infrastructure
$181.1k - $318.4k
...AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Infrastructure Apple is where individual imaginations... ...will include: Drive large-scale pre-training initiatives to... ...training. Architect a robust MLOps platform to streamline and...
Training
Relocation
Apple
San Francisco, CA
8 hours ago
MLOps Engineer
...grade AI systems. As an MLOps Engineer, you will design,... ...challenges and enabling ML teams to move faster.... ...pipelines. Automate model training, validation, and... ...Biases, Kubeflow Cloud Platforms: AWS (SageMaker, S3, EC... ...distributed systems and large-scale data processing....
Training
Sierracorp
San Francisco, CA
1 day ago
Founding MLOps Engineer
$250k
...is hiring a talented ML/AI Research Engineer to join their team in... ...agents and models to be trained, evaluated, deployed and... ...and governance of large‑scale AI systems. Build end... ...years of experience in MLOps, ML infrastructure or backend/platform engineering. Proven experience...
Training
Alldus International Consulting Ltd
San Francisco, CA
3 days ago
ML Platform & Infrastructure Engineer
...Type On‑site What You’ll Do Training Automation: Design and... ...degree in Computer Science, Engineering, or equivalent practical experience... ...in Software Engineering, MLOps, or ML Infrastructure Experience building... ...will define how experiments scale, how reliability is measured...
Training
Full time
Immediate start
Relocation package
Night shift
AGI Inc
San Francisco, CA
1 day ago
ML Engineer - Tilde Research
$150k - $350k
...Description Job Description ML Engineer — Tilde Research... ..., pretraining, scaling laws, and architecture... ...systems, own end-to-end training pipelines, and operate... ...architectures, debug large-scale training issues,... ...This Role Is NOT For MLOps or infrastructure-focused...
Training
David Joseph & Company
San Francisco, CA
2 days ago
Sr. ML Engineer - ML & Applied AI
...Machine Learning Engineer with 10+ years... ...design, build, and scale production-grade... ...on end-to-end ML system ownership... ...engineering, model training, deployment,... ...of scalable ML platforms, drive best practices in MLOps, and enable reliable... ...involving large language models...
Training
Gap Inc.
San Francisco, CA
2 days ago
ML Ops Engineer Agentic AI Lab (Founding Team)
...ML Ops Engineer — Agentic AI Lab (Founding Team) Location... ...automating the model training, deployment,... ...Infrastructure: ~4+ years in MLOps, ML platform engineering, or infra... ...~ Experience with large model deployments (... ...(spot instance scaling, batch prioritization...
Training
Full time
Fabrion
San Francisco, CA
2 days ago
Remote ML Platform Engineer - Scale AI Infrastructure
A leading livestream shopping platform is looking for an AI/ML Platform Engineer to shape the future of AI and ML systems. This role involves designing the infrastructure... ..., working alongside experts to deploy models at scale. Candidates should have extensive experience in...
Remote work
Flexible hours
Whatnot
San Francisco, CA
8 hours ago
ML Infrastructure Engineer
...Accelerated AI Server Engineer Sygaldry... ...speed up training and inference for... ...combination of cost, scale, and speed... ...manage the compute platform this team runs... ...simulation, large-scale numerical... ...Python-based ML and scientific... ...experience MLops or research computing...
Training
Casual work
Local area
Visa sponsorship
Sygaldry
San Francisco, CA
2 days ago
ML Engineer
...Machine Learning Engineer opportunities... ...preprocessing, training, testing, and deployment... ...end-to-end ML pipelines... ...Implement robust MLOps practices such... ...monitoring. Analyze large datasets to... ...pipelines that scale reliably across... ...model-serving platforms for LLM inference...
Training
Flexible hours
AI Chopping Block, Inc.
San Francisco, CA
3 days ago
Senior MLOps Engineer — Scale AI Pipelines for CTV
$155.58k - $320.32k
A leading social media platform is seeking a Senior MLOps Engineer to enhance their Connected TV ad-buying platform. The ideal candidate will scale machine learning practices, improve developer experiences, and provide technical leadership. Strong expertise in Linux, high...
Pinterest
San Francisco, CA
4 days ago
Machine Learning Engineer — Large Language Models, Generative AI & Agentic Systems
$147.4k - $272.1k
Machine Learning Engineer — Large Language Models, Generative... ...AI The Intelligence Platform team empowers clients... ...quality inferences at scale! Description We are in... ...is curiosity, strong ML fundamentals, and the... ...Experience with model training, fine-tuning, or building...
Training
Relocation
Apple Inc.
San Francisco, CA
5 days ago
Staff Machine Learning Platform Engineer
$246.5k - $339k
...technology wholesale platform built on the... ...role As a Staff Machine Learning Platform Engineer, you will help design... ...a scalable ML platform to accelerate model training, deployment, and... ...critical team that scales Faire's ability... ...of MLOps best practices:...
Training
Work experience placement
Work at office
Local area
Remote work
Monday to Friday
Flexible hours
3 days per week
Faire Inc
San Francisco, CA
1 day ago
Partner 20, Applied ML, Engineer, ASG
$341k - $422k
...Partner 20, Applied ML, Engineer, ASG San Francisco, California, United... ..., from feature engineering and training to large-scale, low-latency serving and robust MLOps infrastructure. The Applied... ...ML infrastructure, influencing platform choices and building cloud-native...
Training
Work at office
2 days per week
3 days per week
Andreessen Horowitz
San Francisco, CA
2 days ago
Staff ML Engineer, AI Platform
$250k - $300k
...AI intelligence platform that restores... ...every quarter. Our engineering roles are... ...convert it into training signal. End-to-... ...customization at scale. Model Serving... ...3+ focused on ML infrastructure,... ...systems ~ Staff-level scope: owned... ...Eng sides of MLOps ~ Track record...
Training
Work at office
Immediate start
Remote work
Flexible hours
Ambience Healthcare
San Francisco, CA
2 days ago
ML Platform Engineer: Training & Inference Engine
...in San Francisco, is seeking an AI Platform Engineer to manage and optimize the training and inference of AI models. You... ...candidate has a solid foundation in ML engineering, particularly with Ray... ...and experience in production-level MLOps. Competitive salary and...
Training
Medium
San Francisco, CA
3 days ago
ML Infrastructure Engineer — Scale Training Pipelines
...take on a hands-on role focused on scaling and optimizing ML training systems. Key responsibilities include... ...will have strong software engineering foundations, hands-on experience in... ...PyTorch, and familiarity with cloud platforms. This position provides a unique opportunity...
Training
Physical Intelligence
San Francisco, CA
3 days ago
Staff Backend Architect for Large-Scale, Real-Time Systems
$230k - $310k
A leading tech company in San Francisco is seeking a Staff Engineer to lead critical backend initiatives. This role involves architecting... ...and strong expertise in event streaming systems and large-scale APIs. The position offers a competitive salary ranging from...
Work at office
Remote work
Gamma
San Francisco, CA
4 days ago
Staff Machine Learning Engineer, Listings and Host Tools Data and AI
Staff Machine Learning Engineer, Listings and Host Tools Data and AI Airbnb was born... ...Intelligence Machine Learning (ULM-ML) team: The ULM-ML team... ...A Typical Day: Work with large scale structured and... ...Learning best practices (eg. training/serving skew minimization,...
Training
Work experience placement
airbnb, Inc.
San Francisco, CA
2 days ago
Principal ML Research Engineer
...Lila is building a platform where AI and... ...seeking a Principal ML Research Engineer to be the... ...Cell- and tissue-scale biology sits at an... ...where warranted) training of domain-specific... ..., observability, MLOps practice - and mentor... ...JAX/TensorFlow); large-scale data loading...
Training
Lila Sciences
San Francisco, CA
2 days ago
Lead AI/ML Engineer (Platform, kubeflow)
$197.3k - $225.1k
...Lead AI/ML Engineer (Platform, kubeflow) Overview At Capital One, we are creating responsible... ...including foundation model training, large language model inference, similarity... ...cost, latency, throughput — of large scale production AI systems. ~ Contribute...
Training
Full time
Part time
Local area
Capital One Financial Corp
San Francisco, CA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff ML Platform Engineer - Large Scale Training (LLMOps/MLOps). Be the first to apply!