Staff ML Platform Engineer - Large Scale Training (LLMOps/MLOps)
TrueFoundry
Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)
We're TrueFoundry, and we're building the foundational infrastructure for production AI systems. We're looking for a Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) to join the team.
The Problem We're Solving
Companies are moving beyond simple chatbots to production agentic systems. These systems route between models. They integrate dozens of tools via protocols like MCP. They orchestrate multi-agent workflows where agents coordinate with other agents. The infrastructure to support this doesn't exist yet. You need a control plane that handles:
- Intelligent routing with observability, cost policies, and fallback logic
- Centralized tool and MCP server management with security and lifecycle controls
- Agent orchestration with governance and guardrails
- A unified compute layer to run self-hosted models, custom tools, and agents
We've built two products to solve this:
AI Gateway is the control plane, five composable components (Prompts, LLM Gateway, MCP Gateway, Guardrails, Agent Gateway) that handle routing, orchestration, and governance.
AI Deploy is the compute layer, a Kubernetes-based platform that abstracts ML workloads as standard software primitives, so everything runs on unified infrastructure.
We're Series A, backed by Intel Capital and Sequoia. Companies like CVS, Mastercard, Siemens, Paytm, Synopsys, and Zscaler run production AI workloads on our platform.
We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.
What You'll Work On
- Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance.
- Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
- Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
- Build platform for developing, deploying and evaluating agentic applications for our end customers.
- Help shape internal standards and best practices across the engineering team for high-scale ML workloads.
What We're Looking For
- 5+ years of hands-on experience building and deploying ML systems at scale.
- 5+ years of writing production quality high performance code.
- Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework.
- Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).
- Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.
- A pragmatic mindset—you know when to optimize and when to ship.
- Bonus: Familiarity with open-source LLM training/fine-tuning.
Why Join TrueFoundry?
- Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni.
- First-hand exposure to building and scaling a deep-tech startup—insights you'll carry if you want to start your own one day.
- Be part of a fearlessly experimental culture focused on customer success and long-term impact.
- Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).
$180k - $250k
...ML Platform / MLOps Engineer Emeryville, California, United States; Hybrid (2-3 days on-site) Profluent is an AI-first... ...engineers to enable reliable, scalable platforms for training, evaluating, and deploying large-scale generative biology models. As an early member...Training$250k
...Consulting Ltd is looking for a talented ML/AI Research Engineer to join their San Francisco team... ...the infrastructure that powers training, deployment, and governance of large-scale AI systems. The ideal candidate has a strong background in MLOps, Kubernetes, and cloud...Training$181.1k - $318.4k
Apple Inc. is looking for a Staff ML Infrastructure Engineer in San Francisco to lead pre-training initiatives for cutting-edge foundation models in machine learning. The successful candidate will have over 6 years of experience in building scalable backend systems, be...Training- PrismML is seeking a Staff-level AI/ML engineer to lead large-scale model training efforts. This role focuses on technical direction, mentoring engineers, and enhancing model quality and system performance. The ideal candidate will design, implement, and optimize distributed...Training
- ...organization in San Francisco seeks an Infrastructure Engineer to design and maintain large distributed ML training and inference clusters. The ideal candidate will... ...like FSDP and DeepSpeed. Proficiency in cloud platforms and containerization is essential. Join us to...Training
- A decentralized AI platform company in the United States is seeking an experienced ML Training Platform Engineer to design and build robust infrastructure for ML training. The... ...conditions. This role is essential for enabling large-scale, collaborative AI development. #J-18808-...Training
- ...company in San Francisco is seeking a skilled ML Infrastructure Engineer to manage and optimize large-scale training systems. In this role, you will design and maintain... ...with JAX, distributed training, and cloud platforms are essential for success in this hands-on position...Training
- ...these capabilities to consumer scale. Grounded in years of... ...read on. What You'll Do Training Automation: Design and... ...degree in Computer Science, Engineering, or equivalent practical experience... ...in Software Engineering, MLOps, or ML Infrastructure ~ Strong Python...TrainingImmediate startRelocation packageNight shift
- ...need to be improved, engineers rely on data to understand... ...We're looking for a ML Platform Engineer with deep... ...help design, deploy, and scale the systems that power... ...pipeline orchestration to training infrastructure and... ...embedding pipelines over large, heterogeneous...TrainingRemote work
$181.1k - $318.4k
...AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Infrastructure Apple is where individual imaginations... ...will include: Drive large-scale pre-training initiatives to... ...training. Architect a robust MLOps platform to streamline and...TrainingRelocation- ...grade AI systems. As an MLOps Engineer, you will design,... ...challenges and enabling ML teams to move faster.... ...pipelines. Automate model training, validation, and... ...Biases, Kubeflow Cloud Platforms: AWS (SageMaker, S3, EC... ...distributed systems and large-scale data processing....Training
$250k
...is hiring a talented ML/AI Research Engineer to join their team in... ...agents and models to be trained, evaluated, deployed and... ...and governance of large‑scale AI systems. Build end... ...years of experience in MLOps, ML infrastructure or backend/platform engineering. Proven experience...Training- ...Type On‑site What You’ll Do Training Automation: Design and... ...degree in Computer Science, Engineering, or equivalent practical experience... ...in Software Engineering, MLOps, or ML Infrastructure Experience building... ...will define how experiments scale, how reliability is measured...TrainingFull timeImmediate startRelocation packageNight shift
$150k - $350k
...Description Job Description ML Engineer — Tilde Research... ..., pretraining, scaling laws, and architecture... ...systems, own end-to-end training pipelines, and operate... ...architectures, debug large-scale training issues,... ...This Role Is NOT For MLOps or infrastructure-focused...Training- ...Machine Learning Engineer with 10+ years... ...design, build, and scale production-grade... ...on end-to-end ML system ownership... ...engineering, model training, deployment,... ...of scalable ML platforms, drive best practices in MLOps, and enable reliable... ...involving large language models...Training
- ...ML Ops Engineer — Agentic AI Lab (Founding Team) Location... ...automating the model training, deployment,... ...Infrastructure: ~4+ years in MLOps, ML platform engineering, or infra... ...~ Experience with large model deployments (... ...(spot instance scaling, batch prioritization...TrainingFull time
- A leading livestream shopping platform is looking for an AI/ML Platform Engineer to shape the future of AI and ML systems. This role involves designing the infrastructure... ..., working alongside experts to deploy models at scale. Candidates should have extensive experience in...Remote workFlexible hours
- ...Accelerated AI Server Engineer Sygaldry... ...speed up training and inference for... ...combination of cost, scale, and speed... ...manage the compute platform this team runs... ...simulation, large-scale numerical... ...Python-based ML and scientific... ...experience MLops or research computing...TrainingCasual workLocal areaVisa sponsorship
- ...Machine Learning Engineer opportunities... ...preprocessing, training, testing, and deployment... ...end-to-end ML pipelines... ...Implement robust MLOps practices such... ...monitoring. Analyze large datasets to... ...pipelines that scale reliably across... ...model-serving platforms for LLM inference...TrainingFlexible hours
$155.58k - $320.32k
A leading social media platform is seeking a Senior MLOps Engineer to enhance their Connected TV ad-buying platform. The ideal candidate will scale machine learning practices, improve developer experiences, and provide technical leadership. Strong expertise in Linux, high...$147.4k - $272.1k
Machine Learning Engineer — Large Language Models, Generative... ...AI The Intelligence Platform team empowers clients... ...quality inferences at scale! Description We are in... ...is curiosity, strong ML fundamentals, and the... ...Experience with model training, fine-tuning, or building...TrainingRelocation$246.5k - $339k
...technology wholesale platform built on the... ...role As a Staff Machine Learning Platform Engineer, you will help design... ...a scalable ML platform to accelerate model training, deployment, and... ...critical team that scales Faire's ability... ...of MLOps best practices:...TrainingWork experience placementWork at officeLocal areaRemote workMonday to FridayFlexible hours3 days per week$341k - $422k
...Partner 20, Applied ML, Engineer, ASG San Francisco, California, United... ..., from feature engineering and training to large-scale, low-latency serving and robust MLOps infrastructure. The Applied... ...ML infrastructure, influencing platform choices and building cloud-native...TrainingWork at office2 days per week3 days per week$250k - $300k
...AI intelligence platform that restores... ...every quarter. Our engineering roles are... ...convert it into training signal. End-to-... ...customization at scale. Model Serving... ...3+ focused on ML infrastructure,... ...systems ~ Staff-level scope: owned... ...Eng sides of MLOps ~ Track record...TrainingWork at officeImmediate startRemote workFlexible hours- ...in San Francisco, is seeking an AI Platform Engineer to manage and optimize the training and inference of AI models. You... ...candidate has a solid foundation in ML engineering, particularly with Ray... ...and experience in production-level MLOps. Competitive salary and...Training
- ...take on a hands-on role focused on scaling and optimizing ML training systems. Key responsibilities include... ...will have strong software engineering foundations, hands-on experience in... ...PyTorch, and familiarity with cloud platforms. This position provides a unique opportunity...Training
$230k - $310k
A leading tech company in San Francisco is seeking a Staff Engineer to lead critical backend initiatives. This role involves architecting... ...and strong expertise in event streaming systems and large-scale APIs. The position offers a competitive salary ranging from...Work at officeRemote work- Staff Machine Learning Engineer, Listings and Host Tools Data and AI Airbnb was born... ...Intelligence Machine Learning (ULM-ML) team: The ULM-ML team... ...A Typical Day: Work with large scale structured and... ...Learning best practices (eg. training/serving skew minimization,...TrainingWork experience placement
- ...Lila is building a platform where AI and... ...seeking a Principal ML Research Engineer to be the... ...Cell- and tissue-scale biology sits at an... ...where warranted) training of domain-specific... ..., observability, MLOps practice - and mentor... ...JAX/TensorFlow); large-scale data loading...Training
$197.3k - $225.1k
...Lead AI/ML Engineer (Platform, kubeflow) Overview At Capital One, we are creating responsible... ...including foundation model training, large language model inference, similarity... ...cost, latency, throughput — of large scale production AI systems. ~ Contribute...TrainingFull timePart timeLocal area
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Staff ML Platform Engineer - Large Scale Training (LLMOps/MLOps). Be the first to apply!
- staff security engineer San Francisco, CA
- assistant engineer San Francisco, CA
- engineering aide San Francisco, CA
- assistant chief engineer San Francisco, CA
- staff engineer San Francisco, CA
- technology administrator San Francisco, CA
- senior staff systems engineer San Francisco, CA
- assistant mechanical engineer San Francisco, CA
- staff data engineer San Francisco, CA
- software engineer staff San Francisco, CA

