Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Engineer, Inference & Model serving

$220k - $320k

Trades Workforce Solutions

ML Model Serving Engineer Want to build the layer that actually makes AI usable in real time? You’ll join a team focused on inference, where performance is the product. This is about delivering low-latency, high-throughput systems across LLMs, speech, and vision models running in production, not offline experiments. They’re building real-time AI systems that need to respond instantly, reliably, and at scale. That means solving hard problems around batching, GPU efficiency, memory constraints, and system-level bottlenecks that most teams never fully crack. You’ll sit at the core of the platform, working across model serving, infrastructure, and performance optimisation. A big part of the role is pushing current tooling beyond its limits, extending frameworks, profiling bottlenecks, and designing systems that hold up under real-world load. This is not about training models. It’s about making them fast, efficient, and production-ready. What you’ll work on: Building high-performance serving systems for LLM, speech, and vision models Scaling inference to production workloads with strict latency requirements Optimising GPU utilisation and execution efficiency Implementing techniques like continuous batching, KV cache optimisation, speculative decoding, and prefill/decode separation Improving frameworks such as vLLM, TensorRT-LLM, Triton, and SGLang Profiling and debugging performance across GPU, memory, and system layers What you’ll bring: Strong experience with ML inference or model serving systems Deep understanding of latency and throughput optimisation in production Solid Python and PyTorch skills, plus a systems or performance engineering mindset Familiarity with distributed systems and production infrastructure Exposure to CUDA, GPU profiling tools, or systems like Kubernetes and Ray is useful, but the key is knowing how to make models run efficiently at scale. You’ll join a highly technical team with experience across major AI labs and big tech. The environment is pragmatic, focused on solving real performance problems rather than abstract research. There’s real ownership here. You’ll help define how next-generation AI systems are served. Package: $220,000 – $320,000 base + equity San Francisco, onsite 3 days per week If you’re interested in working on the part of AI that actually determines whether it works in the real world, this is worth exploring. All applicants will receive a response. #J-18808-Ljbffr Trades Workforce Solutions

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Engineer, Inference & Model serving in San Francisco, CA vacancy
  •  ...Model Implementation Engineer Sciforium is an AI infrastructure company developing next-generation multimodal...  ...and a proprietary, high-efficiency serving platform. Backed by multi-million-...  ...with large-scale model training or inference systems. Contributions to open-... 
    Suggested
    Flexible hours

    Sciforium

    San Francisco, CA
    2 days ago
  • A leading AI platform company in San Francisco is seeking a Software Engineer focused on machine learning performance. This role involves implementing advanced techniques for ML model inference and debugging performance issues with frameworks like PyTorch and TensorRT.... 
    Suggested

    Baseten

    San Francisco, CA
    21 hours ago
  •  ...the practice of medicine—and the inference systems that power them need to be...  ...world-class. We’re looking for an Engineering Manager to lead and grow our Model Inference team. The Inference team...  ...technical direction of how our models are served: from architecting low-latency,... 
    Suggested
    Hourly pay
    Full time
    Flexible hours

    AI Chopping Block, Inc.

    San Francisco, CA
    21 hours ago
  •  ...combination of inventive research, design, and engineering. Our organization is very flat, and...  .... About the Role You will lead the Model Routing & Inference team at Cursor, owning the inference...  ...systems, especially in inference serving, traffic routing, or real‑time data... 
    Suggested

    Anysphere

    San Francisco, CA
    3 days ago
  •  ...technology firm in San Francisco is seeking an ML Infrastructure Engineer, Model Inference to build and optimize AI-driven solutions. You will design scalable Kubernetes clusters, enhance ML model serving infrastructure, and collaborate with cross-functional teams. Ideal... 
    Suggested

    Abridge

    San Francisco, CA
    3 days ago
  • A technology startup in San Francisco is seeking a skilled individual to enhance the API infrastructure supporting AI models. The role involves designing and optimizing backend services, focusing on performance and reliability. Candidates should have over 3 years of experience... 

    Baseten

    San Francisco, CA
    21 hours ago
  • $98k - $140k

     ...AI products. You’ll work with product and engineering teams to build systems to define what “...  ...strategy. As part of that you'll shape Notion’s model strategy and work directly with frontier...  ...working with data — You can self‑serve insights from large datasets, whether through... 
    Live in
    Work at office
    Local area

    Notion

    San Francisco, CA
    21 hours ago
  • $325k

    A leading AI research company in San Francisco seeks an engineer to optimize their powerful AI models for high-volume production environments. The ideal candidate has over 5 years of software engineering experience, strong familiarity with ML architectures, and experience... 

    OpenAI

    San Francisco, CA
    21 hours ago
  • $167.2k - $209k

     ...applications. We are seeking a Senior Engineer 2 to join our AI Inference Data Plane team. In this role, you...  ...customers can deploy and scale their models with industry-leading performance...  ...Familiarity with distributed inference serving frameworks such as llm-d, NVIDIA... 
    Local area
    Remote work
    Worldwide
    Flexible hours

    DigitalOcean

    San Francisco, CA
    4 days ago
  • $160k - $230k

     ...LLM Inference Frameworks and Optimization Engineer San Francisco, Singapore, Amsterdam About the Role At...  ...scalable inference for large language models (LLMs). Our mission is to optimize...  ...parallelism for high-performance serving. Apply CUDA graph optimizations... 
    Full time

    Together AI

    San Francisco, CA
    12 days ago
  • Machine Learning Engineer, Inference Want to solve realtime inference problems where milliseconds...  ...batching, and making state-of-the-art speech models actually behave correctly in realtime...  ..., Triton, ONNX Runtime, and custom serving systems Managing KV cache systems, speculative... 
    Remote work
    Flexible hours

    Trades Workforce Solutions

    San Francisco, CA
    3 days ago
  • $180k - $270k

    Plaud is seeking talented individuals for AI infrastructure roles in San Francisco, focusing on building high-performance inference engines for speech AI. Ideal candidates will have substantial experience in GPU architecture and real-time systems. This position offers a... 

    Plaud

    San Francisco, CA
    3 days ago
  •  ...Series A, and is scaling a world-class engineering team across inference, distributed systems, compiler...  ...runtime layer that executes modern models end-to-end under real production constraints...  ...owning inference or model serving infrastructure end‑to‑end Strong understanding... 

    Acceler8 Talent

    San Francisco, CA
    3 days ago
  •  ...YC and unicorn founders and senior engineers with deep expertise in 3D,...  ...looking for a Founding Engineer, ML Inference with deep expertise in high-performance...  ...performance from generative media models. You'll work across the model-serving stack, designing novel inference frameworks... 
    Relocation
    Visa sponsorship
    Relocation package

    Reactor

    San Francisco, CA
    1 day ago
  • Cartesia is looking for an Inference Engineer in San Francisco to enhance real-time multimodal intelligence. You will design and build scalable, low-latency model inference systems while collaborating with researchers. The ideal candidate has strong engineering skills and... 
    Flexible hours

    Cartesia

    San Francisco, CA
    3 days ago
  • $192k - $260k

    A leading data and AI company is seeking a Staff Engineer to design and implement core systems for Foundation Model Serving. The ideal candidate will have over 10 years of experience in building large-scale distributed systems and will collaborate closely across teams... 

    Databricks Inc.

    San Francisco, CA
    21 hours ago
  • $225k

     ...and code generation to improve models and solve alignment more...  ...RL, ultra‑long context, and inference‑time compute to achieve this...  ...About The Role As a Software Engineer on the Inference & RL Systems...  ...the distributed systems that serve our models in production and... 
    Relocation
    Visa sponsorship

    Magic

    San Francisco, CA
    1 day ago
  •  ...scientists PhDs creatives technologists and engineers working together to empower people and...  ...As an ML Infrastructure Engineer Model Inference at Abridge youll play a pivotal role in...  ...Develop optimize and maintain ML model serving infrastructure ensuring high-performance... 
    Hourly pay
    Full time
    Flexible hours

    Abridge

    San Francisco, CA
    9 days ago
  • A leading data and AI company in San Francisco is seeking a Senior Engineer to enhance their Model Serving platform. This role requires expertise in building large-scale distributed systems and collaboration across teams to optimize performance and reliability. Ideal candidates... 

    Jobleads-US

    San Francisco, CA
    3 days ago
  • Anysphere is looking for an experienced leader for the Model Routing & Inference team in San Francisco. This role involves owning the inference...  ...strong background in high-throughput systems and software engineering fundamentals, combined with leadership skills to mentor... 

    Anysphere

    San Francisco, CA
    3 days ago
  • $217k - $312.2k

     ...to improve their business. Databricks’ Model Serving product provides enterprises with a...  ...models. It offers real‑time, low‑latency inference, governance, monitoring, and lineage. As...  ...SLAs and cost efficiency. As a Senior Engineering Manager, you will lead the team owning... 
    Local area
    Worldwide

    Databricks Inc.

    San Francisco, CA
    4 days ago
  •  ...Responsibilities: Turbocharge our serving layer, consisting of a variety of LLM, speech, and vision models. Partner with ML infrastructure and training engineers to build a fast, cost-effective,...  ...and custom kernels to speed up inference. Find ways to reduce model... 
    Full time
    Contract work
    Flexible hours

    SESAME

    San Francisco, CA
    2 days ago
  • $144k - $164k

     ..., Product Management, Gen AI Model Gateway At Capital One, we’re...  ...experiences. All Generative AI inference traffic at the company will...  ...prototyping, development, and serving). The FM Gateway serves two...  ...analysis, data science, or software engineering. Preferred Qualifications:... 
    Full time
    Part time
    Local area

    Capital One National Association

    San Francisco, CA
    21 hours ago
  •  ...Baseten powers mission‑critical inference for the world’s most dynamic...  ...of AI to bring cutting‑edge models into production. We’re...  ...and help build the platform engineers turn to ship AI products. THE...  ...building or supporting self‑serve workflows. NICE TO HAVE Background... 
    Flexible hours

    Baseten

    San Francisco, CA
    3 days ago
  • $220k - $320k

     ...Help us make inference blazingly fast. If you love squeezing every...  ...and hosts specialized language models for companies that need frontier...  ...-funded ten-person team of engineers who work in-person in...  ...approaches, always with the goal of serving models faster and cheaper at... 
    Work at office

    Inference

    San Francisco, CA
    1 day ago
  •  ...compute into useful intelligence - the inference services that serve LLMs at scale and the data pipelines...  ...about both. Researchers and ML engineers will hand you workloads that barely run...  ...matter. Responsibilities Serve Models at Scale: Design and operate... 
    Flexible hours

    Adaption

    San Francisco, CA
    10 days ago
  • $350k

     ...growing group of committed researchers, engineers, policy experts, and business leaders...  .... About the Role Anthropic's inference fleet serves Claude to millions of users across our...  ...tightly coupled: accelerator kernels, model servers, distributed routing, autoscaling... 
    Work at office
    Visa sponsorship
    Flexible hours
    San Francisco, CA
    18 days ago
  •  ...About the Team Our Inference team brings OpenAI's most capable research and technology...  ...use and access our start-of-the-art AI models, allowing them to do things that they've...  ...About the Role We are looking for an engineer who wants to take the world's largest... 

    OpenAI

    San Francisco, CA
    4 days ago
  • $192k - $260k

     ...improve their business. Foundation Model Serving is the API Product for hosting and serving frontier AI model inference for open source models like Llama, Qwen,...  ...experience is necessary. We're looking for engineers who have owned high scale operational sensitive... 
    Local area
    Worldwide

    Databricks

    San Francisco, CA
    1 day ago
  • $192k - $260k

     ...improve their business. Databricks' Model Serving product provides enterprises with a...  .... It offers real-time, low-latency inference, governance, monitoring, and lineage. As...  ...SLAs and cost efficiency. As a Staff Engineer, you'll play a critical role in shaping... 
    Local area
    Worldwide

    Databricks

    San Francisco, CA
    21 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Engineer, Inference & Model serving. Be the first to apply!