Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Engineer, Inference & Model serving

$220k - $320k

techire ai

Job Description

ML Model Serving Engineer

Want to build the layer that actually makes AI usable in real time?

You'll join a team focused on inference, where performance is the product. This is about delivering low-latency, high-throughput systems across LLMs, speech, and vision models running in production, not offline experiments.

They're building real-time AI systems that need to respond instantly, reliably, and at scale. That means solving hard problems around batching, GPU efficiency, memory constraints, and system-level bottlenecks that most teams never fully crack.

You'll sit at the core of the platform, working across model serving, infrastructure, and performance optimisation. A big part of the role is pushing current tooling beyond its limits, extending frameworks, profiling bottlenecks, and designing systems that hold up under real-world load.

This is not about training models. It's about making them fast, efficient, and production-ready.

What you'll work on:
  • Building high-performance serving systems for LLM, speech, and vision models
  • Scaling inference to production workloads with strict latency requirements
  • Optimising GPU utilisation and execution efficiency
  • Implementing techniques like continuous batching, KV cache optimisation, speculative decoding, and prefill/decode separation
  • Improving frameworks such as vLLM, TensorRT-LLM, Triton, and SGLang
  • Profiling and debugging performance across GPU, memory, and system layers
What you'll bring:
  • Strong experience with ML inference or model serving systems
    ID: 34247

    Copilot Symbol
    Access Evo Actions
    Engineer, Inference & Model serving
    Sesame AI
    Job ID: 34247

    Applications
    57
    Shortlisted
    4
    Sent
    11

    1st Interview
    13

    2nd+ Interview
    0

    Offers
    0

    Placed
    0

    Renewal
    0

    Details Custom Fields Descriptions & Ratings Compensation & Fees Activities Files Onboarding Approval process Shift Setting Integrations

    Upload JD
    No file chosen
    Original document

    Job Summary
    Public job description
    Internal job description
    Ratings & Screening questions

    Note: This JD will be posted to job boards; please remember to remove the Company details and Contact information.

    Quick Post Job

    Job title
    Engineer, Inference & Model serving

    Job owner: Marc Powell

    Company: Sesame AI

    Contact: Brown Ryan

    Privacy
    Only Public Jobs can be shared
    Private Public

    Apps
    Visit the App Store

    indeed
    Your job will go live on Indeed once it adheres to their quality standards.
    For more information on this, please head to our Help Center

    Your changes have been saved successfully.
  • Deep understanding of latency and throughput optimisation in production
  • Solid Python and PyTorch skills, plus a systems or performance engineering mindset
  • Familiarity with distributed systems and production infrastructure

Exposure to CUDA, GPU profiling tools, or systems like Kubernetes and Ray is useful, but the key is knowing how to make models run efficiently at scale.

You'll join a highly technical team with experience across major AI labs and big tech. The environment is pragmatic, focused on solving real performance problems rather than abstract research.

There's real ownership here. You'll help define how next-generation AI systems are served.

Package:
$220,000 - $320,000 base + equity
San Francisco, onsite 3 days per week

If you're interested in working on the part of AI that actually determines whether it works in the real world, this is worth exploring.

All applicants will receive a response.
Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the Engineer, Inference & Model serving in San Francisco, CA vacancy
  •  ...Model Implementation Engineer Sciforium is an AI infrastructure company developing next-generation multimodal...  ...and a proprietary, high-efficiency serving platform. Backed by multi-million-...  ...with large-scale model training or inference systems. Contributions to open-... 
    Suggested
    Flexible hours

    Sciforium

    San Francisco, CA
    1 day ago
  • $167.2k - $209k

     ...applications. We are seeking a Senior Engineer 2 to join our AI Inference Data Plane team. In this role, you...  ...can deploy and scale their models with industry-leading performance and...  ...Familiarity with distributed inference serving frameworks such as llm‑d, NVIDIA Dynamo... 
    Suggested
    Local area
    Remote work
    Worldwide
    Flexible hours

    DigitalOcean

    San Francisco, CA
    3 days ago
  • $160k - $230k

     ...LLM Inference Frameworks and Optimization Engineer San Francisco, Singapore, Amsterdam About the Role At...  ...scalable inference for large language models (LLMs). Our mission is to optimize...  ...parallelism for high-performance serving. Apply CUDA graph optimizations... 
    Suggested
    Full time

    Together AI

    San Francisco, CA
    6 days ago
  •  ...Job Description Machine Learning Engineer, Inference Want to solve realtime inference problems...  ..., and making state-of-the-art speech models actually behave correctly in realtime...  ...TensorRT, Triton, ONNX Runtime, and custom serving systems Managing KV cache systems,... 
    Suggested
    Remote work
    Flexible hours

    techire ai

    San Francisco, CA
    4 days ago
  • $350k

     ...growing group of committed researchers, engineers, policy experts, and business leaders...  ...systems. About the Role Anthropic's inference fleet serves Claude to millions of users across our...  ...tightly coupled: accelerator kernels, model servers, distributed routing,... 
    Suggested
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    2 days ago
  • A leading data and AI company in San Francisco is seeking a Senior Engineer to enhance their Model Serving platform. This role requires expertise in building large-scale distributed systems and collaboration across teams to optimize performance and reliability. Ideal candidates... 

    Jobleads-US

    San Francisco, CA
    2 days ago
  •  ...Responsibilities: Turbocharge our serving layer, consisting of a variety of LLM, speech, and vision models. Partner with ML infrastructure and training engineers to build a fast, cost-effective,...  ...and custom kernels to speed up inference. Find ways to reduce model... 
    Full time
    Contract work
    Flexible hours

    SESAME

    San Francisco, CA
    1 day ago
  • $220k - $320k

     ...Help us make inference blazingly fast. If you love squeezing every...  ...and hosts specialized language models for companies that need frontier...  ...-funded ten-person team of engineers who work in-person in...  ...approaches, always with the goal of serving models faster and cheaper at... 
    Work at office

    Inference

    San Francisco, CA
    5 days ago
  •  ...About the Team Our Inference team brings OpenAI's most capable research and technology...  ...use and access our start-of-the-art AI models, allowing them to do things that they've...  ...About the Role We are looking for an engineer who wants to take the world's largest... 

    OpenAI

    San Francisco, CA
    3 days ago
  •  ...compute into useful intelligence - the inference services that serve LLMs at scale and the data pipelines...  ...about both. Researchers and ML engineers will hand you workloads that barely run...  ...matter. Responsibilities Serve Models at Scale: Design and operate... 
    Flexible hours

    Adaption

    San Francisco, CA
    4 days ago
  • $192k - $260k

     ...improve their business. Databricks' Model Serving product provides enterprises with a...  .... It offers real-time, low-latency inference, governance, monitoring, and lineage. As...  ...SLAs and cost efficiency. As a Staff Engineer, you'll play a critical role in shaping... 
    Local area
    Worldwide

    Databricks

    San Francisco, CA
    4 days ago
  • $192k - $260k

     ...improve their business. Foundation Model Serving is the API Product for hosting and serving frontier AI model inference for open source models like Llama, Qwen,...  ...experience is necessary. We're looking for engineers who have owned high scale operational sensitive... 
    Local area
    Worldwide

    Databricks

    San Francisco, CA
    5 days ago
  • $240k - $400k

     ...artificial intelligence. Role Summary As our Founding Engineer, you will own a zero-to-one product and its...  ...generation. Familiarity with LangGraph is a plus. Stand up inference paths with low latency serving and token-level observability Productionize prompt... 
    Visa sponsorship

    pear.ai

    San Francisco, CA
    5 hours ago
  • $175k - $275k

     ...centric design. We are seeking an Agentic Engineer with over 6 years of experience and a...  ...transform how our platform operates and serves customers. Key Responsibilities...  ...databases, embedding systems, and real-time inference. Experience with agent architecture patterns... 

    MGT Insurance

    San Francisco, CA
    3 hours ago
  • $280k

     ...group of committed researchers, engineers, policy experts, and business...  ...OS internals Language modeling with transformers Representative...  ...our models to low-precision inference Write a custom load-balancing algorithm to optimize serving efficiency Build... 
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    5 days ago
  • $280k

     ...group of committed researchers, engineers, policy experts, and business...  ...possible with large language models. You'll be responsible for...  ...capabilities and dramatically improve inference efficiency. Working at the...  ...bottlenecks in production serving infrastructure Partner... 
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    3 days ago
  • $300k

     ...technology firm in San Francisco seeks a GPU Optimisation Engineer to maximize GPU performance in real-time AI systems....  ...of GPU execution, and a knack for optimizing inference latency for large generative models. With a competitive base salary of up to ~$300,000 and... 
    Visa sponsorship
    Relocation package

    Trades Workforce Solutions

    San Francisco, CA
    4 days ago
  •  ...Location: Remote Role Description If you’re a senior construction engineering professional who thrives on precision, constructability, field...  ...work. You’ll challenge and evaluate advanced language models on construction engineering topics to strengthen model reasoning... 
    For contractors
    Remote work

    YO IT Consulting

    San Francisco, CA
    5 days ago
  •  ...do. We're pioneering the model architectures that will make...  ...model innovation and systems engineering paired with a design-minded product...  ...Role We're hiring an Inference Engineer to advance our...  ...reliable model inference and serving stack for our cutting edge foundation... 
    Work at office
    Visa sponsorship
    Flexible hours

    Cartesia, Inc.

    San Francisco, CA
    4 days ago
  •  ...GPU Kernel Engineer Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding...  ...large-scale training and inference. This role is ideal for someone... 
    Flexible hours

    Sciforium

    San Francisco, CA
    1 day ago
  •  ...is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers...  ...day is currently Tuesday. Product Engineering at Lambda is responsible for building...  ...systems and supporting AI training and inference at scale. Lambda's Infrastructure Engineering... 
    Work at office
    Local area
    Work from home
    Flexible hours

    Lambda

    San Francisco, CA
    1 day ago
  •  ...Baseten powers mission-critical inference for the world's most dynamic...  ...of AI to bring cutting-edge models into production. We're...  ...and help build the platform engineers turn to to ship AI products....  ...spans distributed systems, model serving, and developer experience. You... 
    Full time
    Flexible hours

    Baseten

    San Francisco, CA
    1 day ago
  •  ...Sciforium's Next-Generation Model Serving Platform Architect Sciforium is an AI infrastructure...  ...AMD with hands-on support from AMD engineers the team is scaling rapidly to build...  ...batching, scheduling, and distributed inference systems. Develop high-performance C++... 
    Work at office
    Flexible hours

    Sciforium

    San Francisco, CA
    1 day ago
  • $216k - $270k

     ...As a Software Engineer on the ML Infrastructure team, you will design...  ..., reliable, and efficient serving of LLMs. Our platform powers...  ...engineers to integrate and optimize models for production and research...  ...-LLM, or text-generation-inference. Compensation packages... 
    Full time

    Scale AI

    San Francisco, CA
    7 hours ago
  • $155k - $245k

     ...from batteries we already have. Project Engineer, Energy Storage Position Summary:...  ...and escalate schedule or cost risks Serve as the primary interface between Business...  ...professional or employment information, and inferences drawn from your PI. We collect your PI... 
    Full time
    Casual work
    Work at office
    Local area

    Redwood Materials

    San Francisco, CA
    4 hours ago
  • $187.5k - $247.5k

     .... Staff Mechanical Design Engineer, EPC Redwood Materials is...  ...Responsibilities will include: Serve as Mechanical Engineer of...  ...Pipe-Flo (or other hydraulic modeling application), Caesar II, AspenTech...  ...employment information, and inferences drawn from your PI. We... 
    Full time
    Work experience placement

    Redwood Materials

    San Francisco, CA
    4 days ago
  •  ...that our platform delivers AI inference. Validating whether inference...  ...looking for a dedicated QA engineer who can own the product's quality...  ...AI inference quality, model deployments, and integrations...  ...~ Working knowledge of LLM serving. ~ Strong experience testing... 
    Worldwide
    Flexible hours

    FriendliAI Corp

    San Francisco, CA
    2 days ago
  •  ...journey to do our best. Helping the customers and businesses we serve to make better and smarter financial decisions and enabling the...  ...statistical Treasury Risk and Pre-provision Net Revenue (PPNR) models. Regularly reviews model monitoring reports. The models may cover... 
    Temporary work
    3 days per week

    U.S. Bancorp

    San Francisco, CA
    5 days ago
  • $170.26k - $200.3k

     ...journey to do our best. Helping the customers and businesses we serve to make better and smarter financial decisions and enabling the...  ...One. Job Description U.S. Bank is seeking an experienced Model Validation Manager to lead validation efforts for our Wholesale... 
    Temporary work
    Local area
    3 days per week

    U.S. Bank

    San Francisco, CA
    2 days ago
  • $405k

     ...growing group of committed researchers, engineers, policy experts, and business leaders working...  ...* Architect eval frameworks that measure model capabilities across diverse coding tasks...  ...them—and drive them to completion * Serve as a senior technical bridge between... 
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Engineer, Inference & Model serving. Be the first to apply!