Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Staff ML Infrastructure Engineer: Scale Training & Inference

$300k - $430k

Decagon

About Decagon Decagon is the leading conversational AI platform empowering every brand to deliver concierge customer experiences. Our technology enables industry-defining enterprises like Avis Budget Group, Block’s Cash App and Square, Chime, Oura Health, and Hunter Douglas to deploy AI agents that power personalized, deeply satisfying interactions across voice, chat, email, SMS, and every other channel. We’re building a future where customer experiences are being redefined from support tickets and hold music to faster resolutions, richer conversations, and deeper relationships. We’re proud to be backed by world-class investors who share that vision, including a16z, Accel, Bain Capital Ventures, Coatue, and Index Ventures, along with many others. We’re an in-office company, driven by a shared commitment to excellence and velocity. Our values — Just Get It Done, Invent What Customers Want, Winner’s Mindset, and The Polymath Principle — shape how we work and grow as a team. About the Team The ML Infrastructure team builds the systems that power every stage of Decagon's model lifecycle. We own the platforms for model training, the infrastructure for model evaluation and experimentation, and the routing layer that manages inference across multiple providers. We work at the intersection of research and production: translating cutting-edge ML techniques into reliable, scalable systems that run in customer environments. We collaborate closely with Research, Infrastructure, and Product teams to ensure models train efficiently, serve reliably, and deliver exceptional user experiences. The team values technical rigor, pragmatic decision-making, and building systems that others love to use. About the Role We're hiring a Staff ML Infrastructure Engineer to own the platforms powering Decagon's model training and inference. You'll build distributed training systems, design inference architecture across multiple providers, and create the frameworks that let our Research and Product teams ship faster. This role is for someone who thrives on technical depth, can lead multi-quarter initiatives, and wants to shape the long-term architecture of our ML stack. In this role, you will Design and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scale Implement and integrate state-of-the-art training algorithms into production pipelines Own inference architecture and multi-provider routing, including failover and optimization Research and implement inference optimizations including quantization, speculative decoding, and batching strategies Lead initiatives to improve latency and cost efficiency across the training and serving stack Build evaluation and experimentation infrastructure that enables rapid, reliable iteration Drive technical direction, mentor engineers, and establish best practices for ML infrastructure Your background looks something like this 8+ years building ML infrastructure or production systems at scale Deep experience with distributed training: multi-node GPU clusters, fault tolerance, and optimization Strong understanding of LLM inference: latency optimization, provider tradeoffs, and serving architecture Proficiency in Python and modern ML frameworks (PyTorch, JAX, or TensorFlow) Proven track record leading complex, multi-quarter technical projects Benefits Medical, dental, and vision benefits Take what you need vacation policy Daily lunches, dinners and snacks in the office to keep you at your best Compensation $300K – $430K + Offers Equity #J-18808-Ljbffr Decagon

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Staff ML Infrastructure Engineer: Scale Training & Inference in San Francisco, CA vacancy
  •  ...Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) We're TrueFoundry, and we're building the foundational infrastructure for production AI systems. We're looking for a Staff ML Platform...  ...high-throughput, low-latency inference pipelines for state-of-the-art... 
    Training
    Flexible hours

    TrueFoundry

    San Francisco, CA
    1 day ago
  • $320k

     ...Staff + Sr. Software Engineer, Cloud Inference San Francisco, CA | Seattle, WA...  ...Cloud Inference team scales and optimizes...  ...backend services and infrastructure that serve Claude...  ...prior inference or ML experience is not...  ...combination of education, training, and/or experience... 
    Training
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    2 days ago
  •  ...to take on a hands-on role focused on scaling and optimizing ML training systems. Key responsibilities include owning the training infrastructure, improving performance, and managing...  ...candidates will have strong software engineering foundations, hands-on experience in JAX... 
    Training

    Physical Intelligence

    San Francisco, CA
    3 days ago
  • A healthcare technology firm in San Francisco is seeking an ML Infrastructure Engineer, Model Inference to build and optimize AI-driven solutions. You will design scalable Kubernetes clusters, enhance ML model serving infrastructure, and collaborate with cross-functional... 
    Suggested

    Abridge

    San Francisco, CA
    3 days ago
  •  ...creatives, technologists, and engineers working together to...  ...Pittsburgh. The Role As an ML Infrastructure Engineer, Model Inference at Abridge, you’ll play a...  ...AI model inference and training Develop, optimize, and maintain...  ...ML and product teams to scale backend infrastructure... 
    Training
    Hourly pay
    Full time
    Flexible hours

    Abridge

    San Francisco, CA
    3 days ago
  • A leading AI research organization in San Francisco seeks an Infrastructure Engineer to design and maintain large distributed ML training and inference clusters. The ideal candidate will have a strong grasp of optimizing training workloads and experience with distributed... 
    Training

    Causal Labs

    San Francisco, CA
    4 days ago
  • Reducto, a fast-growing AI company in San Francisco, is hiring a Machine Learning Infra Engineer. This role involves building and maintaining the training and inference frameworks necessary for optimal performance. Ideal candidates should possess strong Python skills,... 
    Training

    Reducto

    San Francisco, CA
    4 days ago
  •  ...located in San Francisco, is seeking an AI Platform Engineer to manage and optimize the training and inference of AI models. You will lead efforts in operating the...  ...clusters. The ideal candidate has a solid foundation in ML engineering, particularly with Ray, LLMs, and... 
    Training

    Medium

    San Francisco, CA
    3 days ago
  • $250k - $350k

     ...them actually work. We're hiring ML Infrastructure Engineers to tackle a hard, real-world...  ...sites using wearable devices, large-scale video, and AI. This isn't clean...  ...handling millions of hours of data Training and inference systems for multimodal / LLM-based... 
    Training

    techire ai

    San Francisco, CA
    1 day ago
  •  ...The Role At Mach9, ML infrastructure engineers build and maintain the systems...  ...prediction models serving real-time inference to surveyors and engineers...  ...building for both training and inference. You'll build...  ...friction. Optimize and scale real-time model inference services... 
    Training
    Work experience placement

    Mach9 Robotics Inc

    San Francisco, CA
    7 days ago
  •  ...Accelerated AI Server Engineer Sygaldry...  ...exponentially speed up training and inference for AI. By integrating...  ...combination of cost, scale, and speed necessary...  ...AI. They need compute infrastructure that stays out of their...  ...training) Python-based ML and scientific... 
    Training
    Casual work
    Local area
    Visa sponsorship

    Sygaldry

    San Francisco, CA
    2 days ago
  • $320k - $405k

     ...committed researchers, engineers, policy experts...  ...Learning Infrastructure Engineer to...  ...you'll build and scale the critical infrastructure...  ...and implement ML infrastructure...  ...Optimize inference latency and...  ...of education, training, and/or experience...  ...we expect all staff to be in one of... 
    Training
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    1 day ago
  • $200k - $280k

    Engineering San Francisco Full-time $200,000 - $280,000 About the Role Join our ML Infrastructure team to build the systems that train, deploy, and serve our AI models at scale. You'll work at the intersection of machine...  ...for low-latency inference Implement monitoring... 
    Training
    Full time
    Work at office

    Lattice, Inc.

    San Francisco, CA
    4 days ago
  • URun in San Francisco is searching for an ML Infrastructure and Platform Engineer. In this role, you will lead the architecture and scaling of our GPU compute platform from the...  ...ensuring high availability and low-latency inference. This is a founding technical hire... 

    URun

    San Francisco, CA
    5 days ago
  • Reducto, Inc. is hiring a Machine Learning Infra Engineer in San Francisco to build and maintain ML training and inference frameworks. The role focuses on high performance and scaling across multiple nodes and GPUs. The ideal candidate will have strong Python skills and... 
    Training

    Reducto, Inc.

    San Francisco, CA
    4 days ago
  • A leading technology company is looking for an ML Infrastructure Engineer in San Francisco. The successful candidate will build and maintain ML training pipelines and ensure low-latency model serving. Candidates should have over 4 years of experience in ML engineering,... 
    Training
    Work at office

    Lattice, Inc.

    San Francisco, CA
    4 days ago
  •  ...company based in San Francisco is seeking a specialist to design and operate large-scale GPU infrastructure. This role requires expertise in deploying GPU systems for high-throughput inference and model performance optimization. The ideal candidate will have hands-on... 
    Training

    Reflection AI

    San Francisco, CA
    5 days ago
  •  ...We are seeking a Data Infrastructure Engineer to build and operate...  ...complexity, and product usage scale. What You’ll Do...  ...scalable data and ML infrastructure on AWS...  ...support perception model training and evaluation...  ...training, evaluation, batch inference, or model deployment... 
    Training
    Permanent employment
    Full time

    Matter Intelligence

    San Francisco, CA
    5 days ago
  •  ...capabilities to consumer scale. Grounded in...  ...You'll Do Training Automation:...  ...requirement. Evaluation Infrastructure: Build scalable...  ...health Inference cost and unit economics...  ...Science, Engineering, or equivalent practical...  ..., MLOps, or ML Infrastructure... 
    Training
    Immediate start
    Relocation package
    Night shift

    AGI

    San Francisco, CA
    2 days ago
  •  ...ML Platform Engineer Build the data infrastructure for robots operating in the real world. Robotics...  ...help design, deploy, and scale the systems that power...  ...ML platform itself, from inference serving and pipeline orchestration to training infrastructure and evaluation... 
    Training
    Remote work

    Foxglove

    San Francisco, CA
    2 days ago
  • $250k - $300k

     ...quarter. Our engineering roles are hybrid...  ...& Release Infrastructure — Automated graders...  ...exact inference inputs (retrieved...  ...convert it into training signal. End-to...  ...customization at scale. Model...  ...3+ focused on ML infrastructure...  ...data systems ~ Staff-level scope: owned... 
    Training
    Work at office
    Immediate start
    Remote work
    Flexible hours

    Ambience Healthcare

    San Francisco, CA
    2 days ago
  •  ...ML Infrastructure Engineer Spectral Labs is a spatial intelligence company building reasoning models...  ...Responsibilities Optimize distributed training & RL across our GPU cluster of...  ...Experience optimizing multi-node training at scale Deep understanding of profiler... 
    Training

    Spectral Labs

    San Francisco, CA
    2 days ago
  •  ...Machine Learning Engineer In ML Runtime & Optimization Zensors...  ...level accuracy. To do this at scale, we rely on cutting-edge...  ...compute resources. The AI Infrastructure team at Zensors builds the...  ...to accelerate the training and inference of computer vision models... 
    Training
    Work at office

    Zensors

    San Francisco, CA
    2 days ago
  • $183.7k - $248.6k

     ...Senior Machine Learning Infrastructure Engineer to join our Vector Ads team...  ...platform. This is a high-scale, low-latency environment...  ...infrastructure that brings ML models from training into production, ensuring...  ..., model versioning, and inference optimization What we... 
    Training
    Work at office
    Remote work
    Worldwide
    Relocation package

    UNITY

    San Francisco, CA
    4 days ago
  • $245k - $345k

     ...live commerce at a scale that's never been done...  ...on our news and engineering blogs and join us as...  ...future of AI and ML at Whatnot. You’ll...  ...and scale the core infrastructure that powers machine...  ...serving to distributed training & high‑throughput GPU inference. What you'll do:... 
    Training
    Work experience placement
    Work at office
    Local area
    Remote work
    Work from home
    Home office
    Flexible hours

    Whatnot

    San Francisco, CA
    1 day ago
  •  ...What You’ll Do Training Automation: Design...  ...requirement. Evaluation Infrastructure: Build scalable...  ...cluster health Inference cost and unit...  ...Computer Science, Engineering, or equivalent...  ...Engineering, MLOps, or ML Infrastructure...  ...how experiments scale, how reliability... 
    Training
    Full time
    Immediate start
    Relocation package
    Night shift

    AGI Inc

    San Francisco, CA
    1 day ago
  •  ...through advanced hardware engineering and AI solutions. Our...  ...Machine Learning Infrastructure Engineer to join our team...  ...design, build, and scale infrastructure to power...  ...performance, production-grade ML ecosystem to support...  ...scalable distributed training pipelines, with... 
    Training
    Flexible hours

    Echo Neurotechnologies

    San Francisco, CA
    8 hours ago
  • $197.3k - $225.1k

     ...Lead AI/ML Engineer (Platform, kubeflow) Overview At...  ...investments in technology infrastructure and world-class talent —...  ...foundation model training, large language model inference, similarity search, guardrails...  ..., throughput — of large scale production AI systems.... 
    Training
    Full time
    Part time
    Local area

    Capital One Financial Corp

    San Francisco, CA
    4 days ago
  • $250k - $300k

     ...quarter. Our engineering roles are hybrid...  ...& Release Infrastructure — Automated graders...  ...exact inference inputs (retrieved...  ...convert it into training signal. End‑to...  ...customization at scale. Model...  ...3+ focused on ML infrastructure...  ...data systems Staff‑level scope: owned... 
    Training
    Work at office
    Immediate start

    Ambience

    San Francisco, CA
    3 days ago
  • A forward-thinking AI company seeks experienced ML engineers to build distributed training infrastructure. This role involves designing scalable systems using PyTorch...  ...Ray, ensuring performance and reliability in large-scale environments. The ideal candidates will possess... 
    Training

    Preference Model, Inc.

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff ML Infrastructure Engineer: Scale Training & Inference. Be the first to apply!