Staff ML Infrastructure Engineer: Scale Training & Inference
$300k - $430kDecagon
About Decagon Decagon is the leading conversational AI platform empowering every brand to deliver concierge customer experiences. Our technology enables industry-defining enterprises like Avis Budget Group, Block’s Cash App and Square, Chime, Oura Health, and Hunter Douglas to deploy AI agents that power personalized, deeply satisfying interactions across voice, chat, email, SMS, and every other channel. We’re building a future where customer experiences are being redefined from support tickets and hold music to faster resolutions, richer conversations, and deeper relationships. We’re proud to be backed by world-class investors who share that vision, including a16z, Accel, Bain Capital Ventures, Coatue, and Index Ventures, along with many others. We’re an in-office company, driven by a shared commitment to excellence and velocity. Our values — Just Get It Done, Invent What Customers Want, Winner’s Mindset, and The Polymath Principle — shape how we work and grow as a team. About the Team The ML Infrastructure team builds the systems that power every stage of Decagon's model lifecycle. We own the platforms for model training, the infrastructure for model evaluation and experimentation, and the routing layer that manages inference across multiple providers. We work at the intersection of research and production: translating cutting-edge ML techniques into reliable, scalable systems that run in customer environments. We collaborate closely with Research, Infrastructure, and Product teams to ensure models train efficiently, serve reliably, and deliver exceptional user experiences. The team values technical rigor, pragmatic decision-making, and building systems that others love to use. About the Role We're hiring a Staff ML Infrastructure Engineer to own the platforms powering Decagon's model training and inference. You'll build distributed training systems, design inference architecture across multiple providers, and create the frameworks that let our Research and Product teams ship faster. This role is for someone who thrives on technical depth, can lead multi-quarter initiatives, and wants to shape the long-term architecture of our ML stack. In this role, you will Design and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scale Implement and integrate state-of-the-art training algorithms into production pipelines Own inference architecture and multi-provider routing, including failover and optimization Research and implement inference optimizations including quantization, speculative decoding, and batching strategies Lead initiatives to improve latency and cost efficiency across the training and serving stack Build evaluation and experimentation infrastructure that enables rapid, reliable iteration Drive technical direction, mentor engineers, and establish best practices for ML infrastructure Your background looks something like this 8+ years building ML infrastructure or production systems at scale Deep experience with distributed training: multi-node GPU clusters, fault tolerance, and optimization Strong understanding of LLM inference: latency optimization, provider tradeoffs, and serving architecture Proficiency in Python and modern ML frameworks (PyTorch, JAX, or TensorFlow) Proven track record leading complex, multi-quarter technical projects Benefits Medical, dental, and vision benefits Take what you need vacation policy Daily lunches, dinners and snacks in the office to keep you at your best Compensation $300K – $430K + Offers Equity #J-18808-Ljbffr Decagon
- ...Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) We're TrueFoundry, and we're building the foundational infrastructure for production AI systems. We're looking for a Staff ML Platform... ...high-throughput, low-latency inference pipelines for state-of-the-art...TrainingFlexible hours
$320k
...Staff + Sr. Software Engineer, Cloud Inference San Francisco, CA | Seattle, WA... ...Cloud Inference team scales and optimizes... ...backend services and infrastructure that serve Claude... ...prior inference or ML experience is not... ...combination of education, training, and/or experience...TrainingWork at officeVisa sponsorshipFlexible hours- ...to take on a hands-on role focused on scaling and optimizing ML training systems. Key responsibilities include owning the training infrastructure, improving performance, and managing... ...candidates will have strong software engineering foundations, hands-on experience in JAX...Training
- A healthcare technology firm in San Francisco is seeking an ML Infrastructure Engineer, Model Inference to build and optimize AI-driven solutions. You will design scalable Kubernetes clusters, enhance ML model serving infrastructure, and collaborate with cross-functional...Suggested
- ...creatives, technologists, and engineers working together to... ...Pittsburgh. The Role As an ML Infrastructure Engineer, Model Inference at Abridge, you’ll play a... ...AI model inference and training Develop, optimize, and maintain... ...ML and product teams to scale backend infrastructure...TrainingHourly payFull timeFlexible hours
- A leading AI research organization in San Francisco seeks an Infrastructure Engineer to design and maintain large distributed ML training and inference clusters. The ideal candidate will have a strong grasp of optimizing training workloads and experience with distributed...Training
- Reducto, a fast-growing AI company in San Francisco, is hiring a Machine Learning Infra Engineer. This role involves building and maintaining the training and inference frameworks necessary for optimal performance. Ideal candidates should possess strong Python skills,...Training
- ...located in San Francisco, is seeking an AI Platform Engineer to manage and optimize the training and inference of AI models. You will lead efforts in operating the... ...clusters. The ideal candidate has a solid foundation in ML engineering, particularly with Ray, LLMs, and...Training
$250k - $350k
...them actually work. We're hiring ML Infrastructure Engineers to tackle a hard, real-world... ...sites using wearable devices, large-scale video, and AI. This isn't clean... ...handling millions of hours of data Training and inference systems for multimodal / LLM-based...Training- ...The Role At Mach9, ML infrastructure engineers build and maintain the systems... ...prediction models serving real-time inference to surveyors and engineers... ...building for both training and inference. You'll build... ...friction. Optimize and scale real-time model inference services...TrainingWork experience placement
- ...Accelerated AI Server Engineer Sygaldry... ...exponentially speed up training and inference for AI. By integrating... ...combination of cost, scale, and speed necessary... ...AI. They need compute infrastructure that stays out of their... ...training) Python-based ML and scientific...TrainingCasual workLocal areaVisa sponsorship
$320k - $405k
...committed researchers, engineers, policy experts... ...Learning Infrastructure Engineer to... ...you'll build and scale the critical infrastructure... ...and implement ML infrastructure... ...Optimize inference latency and... ...of education, training, and/or experience... ...we expect all staff to be in one of...TrainingWork at officeVisa sponsorshipFlexible hours$200k - $280k
Engineering San Francisco Full-time $200,000 - $280,000 About the Role Join our ML Infrastructure team to build the systems that train, deploy, and serve our AI models at scale. You'll work at the intersection of machine... ...for low-latency inference Implement monitoring...TrainingFull timeWork at office- URun in San Francisco is searching for an ML Infrastructure and Platform Engineer. In this role, you will lead the architecture and scaling of our GPU compute platform from the... ...ensuring high availability and low-latency inference. This is a founding technical hire...
- Reducto, Inc. is hiring a Machine Learning Infra Engineer in San Francisco to build and maintain ML training and inference frameworks. The role focuses on high performance and scaling across multiple nodes and GPUs. The ideal candidate will have strong Python skills and...Training
- A leading technology company is looking for an ML Infrastructure Engineer in San Francisco. The successful candidate will build and maintain ML training pipelines and ensure low-latency model serving. Candidates should have over 4 years of experience in ML engineering,...TrainingWork at office
- ...company based in San Francisco is seeking a specialist to design and operate large-scale GPU infrastructure. This role requires expertise in deploying GPU systems for high-throughput inference and model performance optimization. The ideal candidate will have hands-on...Training
- ...We are seeking a Data Infrastructure Engineer to build and operate... ...complexity, and product usage scale. What You’ll Do... ...scalable data and ML infrastructure on AWS... ...support perception model training and evaluation... ...training, evaluation, batch inference, or model deployment...TrainingPermanent employmentFull time
- ...capabilities to consumer scale. Grounded in... ...You'll Do Training Automation:... ...requirement. Evaluation Infrastructure: Build scalable... ...health Inference cost and unit economics... ...Science, Engineering, or equivalent practical... ..., MLOps, or ML Infrastructure...TrainingImmediate startRelocation packageNight shift
- ...ML Platform Engineer Build the data infrastructure for robots operating in the real world. Robotics... ...help design, deploy, and scale the systems that power... ...ML platform itself, from inference serving and pipeline orchestration to training infrastructure and evaluation...TrainingRemote work
$250k - $300k
...quarter. Our engineering roles are hybrid... ...& Release Infrastructure — Automated graders... ...exact inference inputs (retrieved... ...convert it into training signal. End-to... ...customization at scale. Model... ...3+ focused on ML infrastructure... ...data systems ~ Staff-level scope: owned...TrainingWork at officeImmediate startRemote workFlexible hours- ...ML Infrastructure Engineer Spectral Labs is a spatial intelligence company building reasoning models... ...Responsibilities Optimize distributed training & RL across our GPU cluster of... ...Experience optimizing multi-node training at scale Deep understanding of profiler...Training
- ...Machine Learning Engineer In ML Runtime & Optimization Zensors... ...level accuracy. To do this at scale, we rely on cutting-edge... ...compute resources. The AI Infrastructure team at Zensors builds the... ...to accelerate the training and inference of computer vision models...TrainingWork at office
$183.7k - $248.6k
...Senior Machine Learning Infrastructure Engineer to join our Vector Ads team... ...platform. This is a high-scale, low-latency environment... ...infrastructure that brings ML models from training into production, ensuring... ..., model versioning, and inference optimization What we...TrainingWork at officeRemote workWorldwideRelocation package$245k - $345k
...live commerce at a scale that's never been done... ...on our news and engineering blogs and join us as... ...future of AI and ML at Whatnot. You’ll... ...and scale the core infrastructure that powers machine... ...serving to distributed training & high‑throughput GPU inference. What you'll do:...TrainingWork experience placementWork at officeLocal areaRemote workWork from homeHome officeFlexible hours- ...What You’ll Do Training Automation: Design... ...requirement. Evaluation Infrastructure: Build scalable... ...cluster health Inference cost and unit... ...Computer Science, Engineering, or equivalent... ...Engineering, MLOps, or ML Infrastructure... ...how experiments scale, how reliability...TrainingFull timeImmediate startRelocation packageNight shift
- ...through advanced hardware engineering and AI solutions. Our... ...Machine Learning Infrastructure Engineer to join our team... ...design, build, and scale infrastructure to power... ...performance, production-grade ML ecosystem to support... ...scalable distributed training pipelines, with...TrainingFlexible hours
$197.3k - $225.1k
...Lead AI/ML Engineer (Platform, kubeflow) Overview At... ...investments in technology infrastructure and world-class talent —... ...foundation model training, large language model inference, similarity search, guardrails... ..., throughput — of large scale production AI systems....TrainingFull timePart timeLocal area$250k - $300k
...quarter. Our engineering roles are hybrid... ...& Release Infrastructure — Automated graders... ...exact inference inputs (retrieved... ...convert it into training signal. End‑to... ...customization at scale. Model... ...3+ focused on ML infrastructure... ...data systems Staff‑level scope: owned...TrainingWork at officeImmediate start- A forward-thinking AI company seeks experienced ML engineers to build distributed training infrastructure. This role involves designing scalable systems using PyTorch... ...Ray, ensuring performance and reliability in large-scale environments. The ideal candidates will possess...Training
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Staff ML Infrastructure Engineer: Scale Training & Inference. Be the first to apply!
- staff security engineer San Francisco, CA
- assistant engineer San Francisco, CA
- engineering aide San Francisco, CA
- assistant chief engineer San Francisco, CA
- staff engineer San Francisco, CA
- technology administrator San Francisco, CA
- senior staff systems engineer San Francisco, CA
- assistant mechanical engineer San Francisco, CA
- staff data engineer San Francisco, CA
- software engineer staff San Francisco, CA

