Inference Engineer

techire ai

Job Description

Machine Learning Engineer, Inference

Want to solve realtime inference problems where milliseconds genuinely matter?

This role is with a fast-growing voice AI company building the realtime speech infrastructure layer behind hundreds of millions of production conversations every month. Their systems power enterprise voice experiences used at massive scale across customer support, ordering, and conversational automation.

This is not another generic AI platform role focused on wrapping APIs or building dashboards.

The work here sits deep in the runtime stack, optimising realtime speech systems under production latency constraints. Think streaming inference, scheduler design, GPU utilisation, concurrency optimisation, dynamic batching, and making state-of-the-art speech models actually behave correctly in realtime environments.

You'll join a lean engineering team working directly on the inference systems behind low-latency conversational speech models. The challenge is not simply generating outputs, it's generating speech naturally, reliably, and fast enough for real human interaction.

Your work will include:

Building and optimising realtime TTS streaming infrastructure
Improving scheduler and batching systems for production workloads
Reducing TTFA/TTFB while maintaining speech quality and stability
GPU profiling and identifying kernel-level bottlenecks
Optimising TensorRT, Triton, ONNX Runtime, and custom serving systems
Managing KV cache systems, speculative decoding, and streaming inference
Supporting heterogeneous deployment environments across NVIDIA and AMD GPUs
Collaborating closely with model researchers to productionise cutting-edge speech systems

A large part of the role involves solving difficult runtime problems where latency consistency, concurrency, and throughput directly impact user experience. The team already operates beyond the performance of most publicly available realtime speech systems, but there's still substantial room to push the infrastructure further.

You'll likely have strong depth across inference systems, runtime optimisation, distributed serving, or GPU performance engineering. Experience with tools like TensorRT, Triton, vLLM, CUDA Graphs, ONNX Runtime, or custom schedulers would be highly valuable.

The environment suits engineers who naturally investigate bottlenecks, enjoy working close to hardware constraints, and care deeply about performance engineering. If reducing latency by 30ms feels meaningful, you'll probably enjoy this team.

The stack includes Rust, C++, Python, CUDA, TensorRT, Triton, Kubernetes, AWS, and custom realtime inference infrastructure.

Compensation is highly competitive and flexible depending on experience, including strong salary, equity, and benefits.

Location: Remote across the US or Europe.

If you're excited by realtime AI systems problems where optimisation work directly shapes production performance at scale, this would be worth exploring.

All applicants will receive a response.

Apply

Vacancy posted 5 days ago

Similar jobs that could be interesting for youBased on the Inference Engineer in San Francisco, CA vacancy

LLM Inference Engineer - Distributed Systems at Scale
Gravity Engineering Services Pvt Ltd. is looking for a Distributed LLM Inference Engineer to join their team. This critical role focuses on enhancing performance for ML inference, ensuring scalability and efficiency in solutions used by both open-source and corporate clients...
Suggested
Gravity Engineering Services Pvt Ltd.
San Francisco, CA
1 day ago
Senior Inference & RL Systems Engineer
$225k
...Our approach combines frontier‑scale pre‑training, domain‑specific RL, ultra‑long context, and inference‑time compute to achieve this goal. About The Role As a Software Engineer on the Inference & RL Systems team, you will design and operate the distributed systems that...
Suggested
Relocation
Visa sponsorship
Magic
San Francisco, CA
1 day ago
Distributed Systems Engineer, Data & Inference Platform
...systems that turn raw compute into useful intelligence - the inference services that serve LLMs at scale and the data pipelines that... ...call pager that keeps you honest about both. Researchers and ML engineers will hand you workloads that barely run; you'll hand them back...
Suggested
Flexible hours
Adaption
San Francisco, CA
20 days ago
Staff ML Inference Systems Engineer - Scalable GPU Infra (SF)
...Member of Technical Staff focused on building and optimizing ML inference systems in San Francisco. The role involves designing end-to-... ...real-world workloads. Candidates should have strong software engineering skills, experience with ML inference systems, and proficiency...
Suggested
Acceler8 Talent
San Francisco, CA
4 days ago
Performance Engineer, Inference Systems
$350k
.... Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. About the Role Anthropic's inference fleet serves Claude to millions of users across our own products...
Suggested
Work at office
Visa sponsorship
Flexible hours
Anthropic
San Francisco, CA
3 days ago
QA Engineer: AI Inference & SaaS Quality
...FriendliAI is seeking a QA engineer in San Francisco to ensure the quality of its innovative AI inference platform. The ideal candidate will have at least 3 years of experience in software quality engineering, strong Python skills, and familiarity with testing frameworks...
Flexible hours
FriendliAI
San Francisco, CA
5 days ago
Production AI Inference Engineer Scale & Impact
...A dynamic AI company in San Francisco is looking for an Applied AI Inference Engineer to develop and deploy high-scale production AI applications. You will partner with customers to transform business goals into reliable services while engaging in software development...
Flexible hours
Baseten
San Francisco, CA
4 days ago
Senior ML Inference Engineer Production Systems
...MakerMaker.AI is looking for a Senior Machine Learning Systems Engineer in San Francisco. In this role, you will build and operate production inference systems, optimizing for performance and reliability. The ideal candidate will have 3+ years of experience in production...
MakerMaker.AI
San Francisco, CA
4 days ago
Senior ML Inference Systems Engineer
...is seeking a Member of Technical Staff to design and optimize inference systems. The role involves managing KV cache allocation and improving... ...components. Ideal candidates should have strong software engineering skills and experience with ML inference systems, particularly...
Gimlet Labs
San Francisco, CA
3 days ago
INFERENCE ENGINEER
...San Francisco, on‑site ABOUT THE ROLE You build and operate the inference systems that serve our models in production. The work spans... ...infrastructure that come with running real workloads. This is an engineering role, not a research role. You'll measure, profile, debug, and...
MakerMaker.AI
San Francisco, CA
5 days ago
Inference Engineer
...team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build... ...foremost experts in AI. About the Role We're hiring an Inference Engineer to advance our mission of building real-time multimodal...
Work at office
Visa sponsorship
Flexible hours
Cartesia, Inc.
San Francisco, CA
5 days ago
GPU Kernel Engineer for AI Inference & Performance
...FriendliAI is seeking a GPU Kernel Engineer in San Francisco to design and optimize GPU kernels for AI inference. This role requires expertise in CUDA, C++, and performance-critical systems. You will work on cutting-edge GPU technology and contribute to a highly collaborative...
FriendliAI
San Francisco, CA
5 days ago
Senior Engineer 2: AI Inference Engine Systems
$167.2k - $209k
...expanding its AI Infrastructure layer to support the next generation of AI-driven applications. We are seeking a Senior Engineer 2 to join our AI Inference Data Plane team. In this role, you will be a key technical leader responsible for designing, developing, and...
Local area
Remote work
Worldwide
Flexible hours
DigitalOcean
San Francisco, CA
4 days ago
Real-Time GPU Inference Optimization Engineer
$300k
...leading technology firm in San Francisco seeks a GPU Optimisation Engineer to maximize GPU performance in real-time AI systems. The ideal... ...understanding of GPU execution, and a knack for optimizing inference latency for large generative models. With a competitive base...
Visa sponsorship
Relocation package
Trades Workforce Solutions
San Francisco, CA
5 days ago
LLM Inference Frameworks and Optimization Engineer
$160k - $230k
...LLM Inference Frameworks and Optimization Engineer San Francisco, Singapore, Amsterdam About the Role At Together.ai, we are building state-of-the-art infrastructure to enable efficient and scalable inference for large language models (LLMs). Our mission is to...
Full time
Together AI
San Francisco, CA
22 days ago
LLM Inference & Model-Performance Engineer
...A leading AI platform company in San Francisco is seeking a Software Engineer focused on machine learning performance. This role involves implementing advanced techniques for ML model inference and debugging performance issues with frameworks like PyTorch and TensorRT...
Baseten
San Francisco, CA
5 days ago
Multimodal Inference Engineer — Scale GPU AI Models
An innovative company is seeking a talented software engineer to join their dynamic Inference team. This role involves designing and implementing infrastructure for large-scale multimodal models, focusing on high-performance delivery of audio and image inputs. You'll collaborate...
OpenAI
San Francisco, CA
4 days ago
Distributed LLM Inference Engineer
...to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date. About the role As a Distributed LLM Inference Engineer, you will help with systems and optimizations that push the boundaries of performance for inference at large scale. This is...
Work at office
Anyscale
San Francisco, CA
5 days ago
LLM Inference Engineer: Frameworks & Optimizations
$160k - $230k
Together AI is seeking an Inference Frameworks and Optimization Engineer in San Francisco, California. The role focuses on designing and optimizing distributed inference engines, ensuring efficient deployment of large language models and vision models. The ideal candidate...
Together AI
San Francisco, CA
4 days ago
LLM Inference & Optimization Engineer
Gravity Engineering Services Pvt Ltd. is looking for an Inference Frameworks and Optimization Engineer to enhance the performance of AI infrastructure. This role involves designing distributed inference engines that support multimodal models, optimizing frameworks for low...
Gravity Engineering Services Pvt Ltd.
San Francisco, CA
2 days ago
Distributed LLM Inference Engineer - Scale AI at Speed
Anyscale is seeking a Distributed LLM Inference Engineer in San Francisco, California. This pivotal role involves pushing the boundaries of performance for ML inference at scale. You'll work closely with product teams to deliver end-to-end solutions while leveraging open...
Anyscale
San Francisco, CA
5 days ago
LLM Inference Engineer: Scalable Serving (SF Onsite)
Gravity Engineering Services Pvt Ltd. is seeking a talented individual in San Francisco to architect and implement robust, scalable inference systems for AI models. This in-person role focuses on optimizing model serving infrastructures for high throughput and low latency...
Gravity Engineering Services Pvt Ltd.
San Francisco, CA
1 day ago
Senior Inference Performance Engineer - GPU & CUDA
$220k - $320k
inference.net, a growing company in San Francisco, seeks an experienced engineer to optimize AI inference performance. The ideal candidate will have over 2 years of experience in ML systems and GPU programming. Key responsibilities include implementing optimization techniques...
inference.net
San Francisco, CA
5 days ago
Founding Engineer, ML Inference
...Join a small, focused team of YC and unicorn founders and senior engineers with deep expertise in 3D, generative video, developer... ...possible. About the Role We're looking for a Founding Engineer, ML Inference with deep expertise in high-performance ML engineering. This...
Relocation
Visa sponsorship
Relocation package
Reactor
San Francisco, CA
1 day ago
Robotics GPU Inference Engineer Hybrid (Relocation)
...OpenAI is seeking a GPU Inference Engineer based in San Francisco, CA. In this high-impact role, you'll optimize inference performance and scalability for Robotics research, driving engineering efforts to enhance model serving and system efficiency. The ideal candidate...
Work at office
Relocation
Relocation package
OpenAI
San Francisco, CA
18 hours ago
Real-Time Inference & Model Serving Engineer (Equity)
$220k - $320k
...ML Model Serving Engineer Want to build the layer that actually makes AI usable in real time? You’ll join a team focused on inference, where performance is the product. This is about delivering low-latency, high-throughput systems across LLMs, speech, and vision models...
3 days per week
Trades Workforce Solutions
San Francisco, CA
5 days ago
Inference Engineer, Robotics
...level AI capabilities with the constraints of physical systems to improve peoples’ lives. About the Role We’re looking for a GPU Inference Engineer to contribute to improvements in model serving efficiency for our Robotics research. This is a high‑impact role where you’ll...
Work at office
Relocation package
OpenAI
San Francisco, CA
21 hours ago
Staff ML Performance & Systems Engineer — Scalable Inference
$180k - $250k
fal, located in San Francisco, is looking for a skilled individual to help maintain generative media models' performance. You will design and implement innovative model serving architectures while working with the Applied ML team and customers. The ideal candidate has expertise...
fal
San Francisco, CA
3 days ago
Technical Support Field Engineer - San Francisco, CA
$36.06 - $40.87 per hour
...health worldwide. Summary : The Technical Support Field Engineer provides on-site technical support for Dentsply Sirona Imaging... ..., certifications, transcripts and languages spoken); and inferences from personal information collected (e.g., a profile reflecting...
Hourly pay
Work experience placement
Work at office
Remote work
Worldwide
Flexible hours
Night shift
Dentsply Sirona
San Francisco, CA
1 day ago
Field Quality Engineer, Energy Storage
$150k - $207.5k
...first time, all from batteries we already have. Field Quality Engineer, Energy Storage Redwood Materials is currently searching... ...personal records, professional or employment information, and inferences drawn from your PI. We collect your PI for our purposes, including...
Full time
Night shift
Weekend work
Redwood Materials
San Francisco, CA
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Inference Engineer. Be the first to apply!