Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Inference Engineer

Trades Workforce Solutions

Machine Learning Engineer, Inference Want to solve realtime inference problems where milliseconds genuinely matter? This role is with a fast-growing voice AI company building the realtime speech infrastructure layer behind hundreds of millions of production conversations every month. Their systems power enterprise voice experiences used at massive scale across customer support, ordering, and conversational automation. This is not another generic AI platform role focused on wrapping APIs or building dashboards. The work here sits deep in the runtime stack, optimising realtime speech systems under production latency constraints. Think streaming inference, scheduler design, GPU utilisation, concurrency optimisation, dynamic batching, and making state-of-the-art speech models actually behave correctly in realtime environments. You’ll join a lean engineering team working directly on the inference systems behind low-latency conversational speech models. The challenge is not simply generating outputs, it’s generating speech naturally, reliably, and fast enough for real human interaction. Your work will include: Building and optimising realtime TTS streaming infrastructure Improving scheduler and batching systems for production workloads Reducing TTFA/TTFB while maintaining speech quality and stability GPU profiling and identifying kernel-level bottlenecks Optimising TensorRT, Triton, ONNX Runtime, and custom serving systems Managing KV cache systems, speculative decoding, and streaming inference Supporting heterogeneous deployment environments across NVIDIA and AMD GPUs Collaborating closely with model researchers to productionise cutting-edge speech systems A large part of the role involves solving difficult runtime problems where latency consistency, concurrency, and throughput directly impact user experience. The team already operates beyond the performance of most publicly available realtime speech systems, but there’s still substantial room to push the infrastructure further. You’ll likely have strong depth across inference systems, runtime optimisation, distributed serving, or GPU performance engineering. Experience with tools like TensorRT, Triton, vLLM, CUDA Graphs, ONNX Runtime, or custom schedulers would be highly valuable. The environment suits engineers who naturally investigate bottlenecks, enjoy working close to hardware constraints, and care deeply about performance engineering. If reducing latency by 30ms feels meaningful, you’ll probably enjoy this team. The stack includes Rust, C++, Python, CUDA, TensorRT, Triton, Kubernetes, AWS, and custom realtime inference infrastructure. Compensation is highly competitive and flexible depending on experience, including strong salary, equity, and benefits. Location: Remote across the US or Europe. If you’re excited by realtime AI systems problems where optimisation work directly shapes production performance at scale, this would be worth exploring. All applicants will receive a response. #J-18808-Ljbffr Trades Workforce Solutions

Vacancy posted 17 hours ago
Similar jobs that could be interesting for youBased on the Inference Engineer in San Francisco, CA vacancy
  • Sail Research in San Francisco is looking for an individual to design and implement high-performance scheduling systems for AI inference processes. This role requires strong foundational knowledge in distributed systems and an eagerness to work closely with agent-based... 
    Suggested

    Sail Research

    San Francisco, CA
    3 days ago
  • Genesis AI is seeking an experienced individual to develop low-latency inference pipelines for on-device deployment in robotics. The role involves designing and optimizing distributed systems on GPU clusters, implementing efficient low-level code such as CUDA and Triton... 
    Suggested

    Genesis AI

    San Francisco, CA
    3 days ago
  •  ...Member of Technical Staff focused on building and optimizing ML inference systems in San Francisco. The role involves designing end-to-...  ...real-world workloads. Candidates should have strong software engineering skills, experience with ML inference systems, and proficiency... 
    Suggested

    Acceler8 Talent

    San Francisco, CA
    2 days ago
  •  ...systems that turn raw compute into useful intelligence - the inference services that serve LLMs at scale and the data pipelines that...  ...call pager that keeps you honest about both. Researchers and ML engineers will hand you workloads that barely run; you'll hand them back... 
    Suggested
    Flexible hours

    Adaption

    San Francisco, CA
    8 days ago
  • $160k - $320k

    A leading AI computing firm is seeking a Systems Engineer in San Francisco or Los Angeles to scale AI inference. Candidates should have strong C++ skills, HPC experience, and knowledge of parallel programming techniques. Responsibilities include designing GPU kernels,... 
    Suggested

    Vast.ai

    San Francisco, CA
    3 days ago
  • Acceler8 Talent is looking for a Software Engineer in San Francisco to focus on building and optimizing inference systems for next-generation AI at scale. You will design production inference pipelines and improve system performance under real production constraints. The... 

    Acceler8 Talent

    San Francisco, CA
    1 day ago
  • FriendliAI is seeking a QA engineer in San Francisco to ensure the quality of its innovative AI inference platform. The ideal candidate will have at least 3 years of experience in software quality engineering, strong Python skills, and familiarity with testing frameworks... 
    Flexible hours

    FriendliAI

    San Francisco, CA
    2 days ago
  •  ...fast-moving environments where the path forward isn't laid out for you , 3+ years of professional software engineering experience with meaningful work on ML inference or high-performance systems , Familiarity with at least one deep learning framework (PyTorch, JAX,... 

    Perplexity AI

    San Francisco, CA
    17 hours ago
  • A dynamic AI company in San Francisco is looking for an Applied AI Inference Engineer to develop and deploy high-scale production AI applications. You will partner with customers to transform business goals into reliable services while engaging in software development... 
    Flexible hours

    Baseten

    San Francisco, CA
    3 days ago
  •  ...is seeking a Member of Technical Staff to design and optimize inference systems. The role involves managing KV cache allocation and improving...  ...components. Ideal candidates should have strong software engineering skills and experience with ML inference systems, particularly... 

    Gimlet Labs

    San Francisco, CA
    1 day ago
  •  ...looking for a Member of Technical Staff focused on ML systems and inference in San Francisco. You will design and build inference systems...  .... Candidates should have strong foundations in software engineering, experience with ML inference systems, and performance tuning... 

    Gimlet Labs, Inc.

    San Francisco, CA
    3 days ago
  • $300k

     ...leading technology firm in San Francisco seeks a GPU Optimisation Engineer to maximize GPU performance in real-time AI systems. The ideal...  ...understanding of GPU execution, and a knack for optimizing inference latency for large generative models. With a competitive base... 
    Visa sponsorship
    Relocation package

    Trades Workforce Solutions

    San Francisco, CA
    3 days ago
  • $167.2k - $209k

     ...expanding its AI Infrastructure layer to support the next generation of AI-driven applications. We are seeking a Senior Engineer 2 to join our AI Inference Data Plane team. In this role, you will be a key technical leader responsible for designing, developing, and... 
    Local area
    Remote work
    Worldwide
    Flexible hours

    DigitalOcean

    San Francisco, CA
    1 day ago
  • $160k - $230k

     ...LLM Inference Frameworks and Optimization Engineer San Francisco, Singapore, Amsterdam About the Role At Together.ai, we are building state-of-the-art infrastructure to enable efficient and scalable inference for large language models (LLMs). Our mission is to... 
    Full time

    Together AI

    San Francisco, CA
    10 days ago
  •  ...team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build...  ...world's foremost experts in AI. About the Role We're hiring an Inference Engineer to advance our mission of building real-time multimodal... 
    Work at office
    Visa sponsorship
    Flexible hours

    Cartesia

    San Francisco, CA
    1 day ago
  •  ...stealth, the company has already reached eight-figure revenue, raised an $80M Series A, and is scaling a world-class engineering team across inference, distributed systems, compiler infrastructure, and high-performance AI compute. Their platform automatically maps complex... 

    Acceler8 Talent

    San Francisco, CA
    1 day ago
  •  ...Join a small, focused team of YC and unicorn founders and senior engineers with deep expertise in 3D, generative video, developer...  ...possible. About the Role We're looking for a Founding Engineer, ML Inference with deep expertise in high-performance ML engineering. This... 
    Relocation
    Visa sponsorship
    Relocation package

    Reactor

    San Francisco, CA
    4 days ago
  • $180k - $270k

    Plaud is seeking talented individuals for AI infrastructure roles in San Francisco, focusing on building high-performance inference engines for speech AI. Ideal candidates will have substantial experience in GPU architecture and real-time systems. This position offers a... 

    Plaud

    San Francisco, CA
    1 day ago
  • An innovative company is seeking a talented software engineer to join their dynamic Inference team. This role involves designing and implementing infrastructure for large-scale multimodal models, focusing on high-performance delivery of audio and image inputs. You'll collaborate... 

    OpenAI

    San Francisco, CA
    2 days ago
  • Cartesia is looking for an Inference Engineer in San Francisco to enhance real-time multimodal intelligence. You will design and build scalable, low-latency model inference systems while collaborating with researchers. The ideal candidate has strong engineering skills... 
    Flexible hours

    Cartesia

    San Francisco, CA
    1 day ago
  • Liquid AI is seeking a Systems Programmer to join their Edge Inference team in San Francisco. In this role, you will implement and optimize inference kernels on various hardware, ensuring efficiency and performance. Ideal candidates have over 5 years of systems programming... 
    Flexible hours

    Liquid AI

    San Francisco, CA
    5 days ago
  • FriendliAI is seeking a GPU Kernel Engineer in San Francisco to design and optimize GPU kernels for AI inference. This role requires expertise in CUDA, C++, and performance-critical systems. You will work on cutting-edge GPU technology and contribute to a highly collaborative... 

    FriendliAI

    San Francisco, CA
    3 days ago
  • Anyscale is seeking a Distributed LLM Inference Engineer in San Francisco, California. This pivotal role involves pushing the boundaries of performance for ML inference at scale. You'll work closely with product teams to deliver end-to-end solutions while leveraging open... 

    Anyscale

    San Francisco, CA
    3 days ago
  • $220k - $320k

    A tech startup specializing in AI inference seeks a skilled professional to optimize their inference stack. Candidates should have over 2 years of experience in ML systems, fluency in Python, and hands-on experience with LLM frameworks. The role offers competitive compensation... 
    Local area

    Inference

    San Francisco, CA
    2 days ago
  •  ...to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date. About the role As a Distributed LLM Inference Engineer, you will help with systems and optimizations that push the boundaries of performance for inference at large scale. This is... 
    Work at office

    Anyscale

    San Francisco, CA
    3 days ago
  • $160k - $230k

    Together AI is seeking an Inference Frameworks and Optimization Engineer in San Francisco, California. The role focuses on designing and optimizing distributed inference engines, ensuring efficient deployment of large language models and vision models. The ideal candidate... 

    Together AI

    San Francisco, CA
    2 days ago
  • $220k - $320k

    inference.net, a growing company in San Francisco, seeks an experienced engineer to optimize AI inference performance. The ideal candidate will have over 2 years of experience in ML systems and GPU programming. Key responsibilities include implementing optimization techniques... 

    inference.net

    San Francisco, CA
    3 days ago
  •  ...and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering...  ...Sciforium is seeking a highly skilled Distributed Training and Inference Engineer to build, optimize, and maintain the critical software... 
    Full time
    Flexible hours

    Sciforium

    San Francisco, CA
    22 hours ago
  • A leading AI platform company in San Francisco is seeking a Software Engineer focused on machine learning performance. This role involves implementing advanced techniques for ML model inference and debugging performance issues with frameworks like PyTorch and TensorRT.... 

    Baseten

    San Francisco, CA
    3 days ago
  • $220k - $320k

    ML Model Serving Engineer Want to build the layer that actually makes AI usable in real time? You’ll join a team focused on inference, where performance is the product. This is about delivering low-latency, high-throughput systems across LLMs, speech, and vision models... 
    3 days per week

    Trades Workforce Solutions

    San Francisco, CA
    17 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Inference Engineer. Be the first to apply!