LLM Inference Engineer

Dormont Manufacturing Co

About Us Hippocratic AI is the leading generative AI company in healthcare. We have the only system that can have safe, autonomous, clinical conversations with patients. We have trained our own LLMs as part of our Polaris constellation, resulting in a system with over 99.9% accuracy. About the Role We're seeking an experienced LLM Inference Engineer to optimize our large language model (LLM) serving infrastructure. The ideal candidate has: Extensive hands‑on experience with state‑of‑the‑art inference optimization techniques A track record of deploying efficient, scalable LLM systems in production environments What You'll Do Design and implement multi-node serving architectures for distributed LLM inference Optimize multi-LoRA serving systems Apply advanced quantization techniques (FP4/FP6) to reduce model footprint while preserving quality Implement speculative decoding and other latency optimization strategies Develop disaggregated serving solutions with optimized caching strategies for prefill and decoding phases Continuously benchmark and improve system performance across various deployment scenarios and GPU types What You Bring Must-Have: Experience optimizing LLM inference systems at scale Proven expertise with distributed serving architectures for large language models Hands‑on experience implementing quantization techniques for transformer models Strong understanding of modern inference optimization methods, including: Speculative decoding techniques with draft models Eagle speculative decoding approaches Proficiency in Python and C++ Experience with CUDA programming and GPU optimization Nice‑to‑Have: Contributions to open‑source inference frameworks such as vLLM , SGLang , or TensorRT-LLM Experience with custom CUDA kernels Track record of deploying inference systems in production environments Deep understanding of performance optimization systems

Show us what you've built: Tell us about an LLM inference or training project that makes you proud! Whether you've optimized inference pipelines to achieve breakthrough performance, designed innovative training techniques, or built systems that scale to billions of parameters – we want to hear your story._
Open source contributor? Even better! If you've contributed to projects like vllm, sglang, lmdeploy or similar LLM optimization frameworks, we'd love to see your PRs. Your contributions to these communities demonstrate exactly the kind of collaborative innovation we value._

Join a team where your expertise won't just be appreciated—it will be celebrated and amplified. Help us shape the future of AI deployment at scale! #J-18808-Ljbffr Dormont Manufacturing Co

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the LLM Inference Engineer in Palo Alto, CA vacancy

LLM Inference Engineer - Scalable GPU Serving
Hippocratic AI is looking for an experienced LLM Inference Engineer based in Palo Alto, CA, to optimize their large language model (LLM) serving infrastructure. You'll design multi-node serving architectures and implement advanced optimization techniques while collaborating...
Suggested
Hippocratic AI
Palo Alto, CA
1 day ago
ML Systems Engineer: Production-Scale LLM Inference
ScOp Venture Capital is looking for an ML Systems Engineer to optimize LLM inference systems crucial for their AI platform. The role focuses on enhancing performance and efficiency via low-level systems optimization, directly impacting industry leader processes in semiconductor...
Suggested
ScOp Venture Capital
Santa Clara, CA
4 days ago
Senior Edge-LLM Real-Time Inference Engineer
NVIDIA Gruppe is looking for a skilled engineer to join their TensorRT Edge-LLM team in Santa Clara, California. The role involves developing a state-of-the-art inference framework for large language models and optimizing it for real-time performance on embedded platforms...
Suggested
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Runtime Engineer — Remote ML Inference & Systems
MatX is seeking a skilled software engineer to build custom silicon for AI language models in Mountain View, California. The role requires... ...building libraries for memory management and developing an LLM inference serving stack. Benefits include generous salary, equity...
Suggested
Remote job
MatX
Mountain View, CA
4 days ago
Remote Inference Optimization Engineer
Modular Mailing Systems, Inc. is seeking an experienced Performance Engineer to optimize LLM inference on their cloud platform. This pivotal role involves building optimization infrastructures and collaborating with teams to enhance performance across GPUs and ASICs. The...
Suggested
Remote job
Flexible hours
Modular Mailing Systems, Inc.
Los Altos, CA
3 days ago
Inference Optimization Engineer United States - Remote · Remote
$198k - $286k
...About the role At Modular, we optimize inference from kernel to cloud on one unified stack... ...optimizations across kernels, the inference engine, and distributed systems so that customer... ...direction of Modular Cloud, delivering LLM performance on the Pareto frontier for agentic...
Remote job
Work experience placement
Work at office
Local area
Flexible hours
Modular Mailing Systems, Inc.
Los Altos, CA
3 days ago
Sr. AI Inference Systems Engineer
$124.8k - $235k
...businesses. What the Role Entails End-to-End Inference Optimization: Lead the optimization of... ...inference pipeline for Large Models (LLM, Multimodal); focus on KV Cache storage... ...Ph.D. in Computer Science, Electronic Engineering, AI, or related fields; significant professional...
Full time
Relocation package
Tencent
Palo Alto, CA
1 day ago
Principal LLM Inference Engineer
...Frontier Group sits at the leading edge of what’s possible with LLM inference on heterogeneous hardware. Our charter spans the full stack:... ...unique computational fabric. We are an applied research and engineering team that moves fast, ships real systems, and works directly...
Full time
d-Matrix
Santa Clara, CA
3 days ago
Senior Systems Engineering
...infrastructure company in California is seeking a Member of Technical Staff — Inference to design and optimize large-scale AI inference systems. The role demands 5+ years in systems engineering and expertise in large-scale inference systems. Successful candidates will...
Flexible hours
RadixArk
Palo Alto, CA
2 days ago
Staff Backend Engineer, ML Inference Systems
$192.6k - $305.6k
...world impact converge at scale. We’re hiring a Staff Backend Engineer to build and operate the infrastructure those models depend on... ...with a focus on the performance, reliability, and scalability of inference systems. Join us and help influence how billions of gaming...
Work at office
Worldwide
Relocation package
Dormont Manufacturing Co
Mountain View, CA
1 day ago
Senior Software Engineer, AI Inference Systems
$184k - $287.5k
...Overview We are seeking highly skilled and motivated software engineers to join us and build AI inference systems that serve large-scale models with extreme... ...Desired Experience Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang). Hands‑on work...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
AI Inference Systems Engineer: High-Throughput
$135k
United States Digital Space LLC in Palo Alto seeks an Application Software Engineer to develop high-performance AI inference systems. This role emphasizes the design and optimization of large-scale systems used for mission-critical applications. The ideal candidate possesses...
Remote work
United States Digital Space LLC
Palo Alto, CA
5 days ago
Senior Principal LLM Engineer & AI Platform Leader
...financial institution is seeking a Senior Principal Software Engineer to provide engineering expertise within the Commercial & Investment... ...solutions, implementing MLOps practices, and optimizing model inference for high-performance applications. The ideal candidate will...
JPMorgan Chase & Co.
Palo Alto, CA
1 day ago
Engineering Manager, Agentic Systems - Moveworks
...workflow automation with Moveworks’ Reasoning Engine and natural language capabilities, we... ...scaling the end-to-end systems for the entire ML/LLM lifecycle. This includes our infrastructure for distributed training and inference, model evaluation frameworks, and LLM latency...
Work at office
Remote work
Flexible hours
ServiceNow
Mountain View, CA
2 days ago
Python AI Platform Engineer — Scalable LLM Systems
Poetiq is seeking an AI Engineer in Los Altos, California. In this role, you will build a robust platform for AI intelligence, relying on foundation models and complex LLM interactions. Your skills in designing high-performance infrastructure and working autonomously on...
Poetiq
Los Altos, CA
1 day ago
Member of Technical Staff (AI Inference Engineer)
$190k - $250k
...Location Type Hybrid Department AI We are looking for an AI Inference engineer to join our growing team. Our current stack is Python, Rust,... ...respond to system outages Explore novel research and implement LLM inference optimizations Qualifications Experience with ML...
Full time
Kindredventures
Palo Alto, CA
3 days ago
LLM Inference Engineering Manager — Hybrid | Equity
A leading technology company in California is seeking an Engineering Manager to lead the development of cutting-edge LLM/VLM technologies. In this hands-on leadership role, you will manage a team responsible for optimizing runtime and frameworks, while collaborating with...
NVIDIA Corporation
Santa Clara, CA
2 days ago
Research Engineer, Training & Inference
...we are building a mathematical reasoning engine that operates with absolute precision. While... ...to our distributed training loops and inference engines. We are seeking engineers who view... ...reinforcement learning and low-latency LLM production traffic. Tune the inference engine...
Harmonic
Palo Alto, CA
4 days ago
Senior Performance Engineer, Inference
...to deliver industry-leading training and inference speeds; over 10 times faster than GPU-based... ...Role We are hiring a Senior Performance Engineer to join our Product team. You are an... ...source inference stacks (vLLM,SGLang,TensorRT-LLM), GPU kernel-level optimization...
Contract work
Shift work
Cerebras Systems, Inc.
Sunnyvale, CA
2 days ago
Inference Engineer: Real-Time AI GPU Acceleration
Sarah Smith Fund in Palo Alto is seeking an Inference Engineer to enhance the performance of AI-driven products. In this technical role, you will design and optimize inference pipelines, implementing cutting-edge acceleration techniques to improve model speed and efficiency...
Sarah Smith Fund
Palo Alto, CA
4 days ago
Senior DL Algorithms Engineer - Inference Performance
$152k - $241.5k
We are looking for a Senior DL Algorithms Engineer for LLM/Omni model optimizations! Seeking senior engineers who are mindful of performance... ...models (like Nemotron and Cosmos) on NVIDIA’s accelerated inference SW stack. Contribute new features, fix bugs and deliver...
NVIDIA
Santa Clara, CA
4 days ago
Compiler Engineer - AI Inference
$152k - $241.5k
.... NVIDIA is seeking top‑tier AI Compiler Engineers to drive innovation within our world‑class... ...problems for AI workloads (both inference and training) and successfully transition... .../or custom AI accelerator architectures. LLM Knowledge: Deep understanding of Large Language...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Staff Engineer, Inference Platform - Distributed Systems
Cerebras is seeking a Staff Engineer to join their Inference Platform team in Sunnyvale, California. This role involves leading and contributing to projects focused on cloud and ML components, with responsibilities spanning design, mentoring, and coding. The ideal candidate...
Cerebras
Sunnyvale, CA
4 days ago
Inference Optimization Engineer (local / edge runtime)
$170.5k - $315.49k
...models fast on the hardware people actually own. You optimize inference engines (llama.cpp, vLLM) for constrained local and edge environments... ...Python; comfortable reading systems‑level code Understands how LLM inference works (attention, KV cache, decoding) Has profiled...
Local area
Immediate start
Shift work
Intel
Santa Clara, CA
3 days ago
Senior DL Inference Engineer - GPU Optimization Equity
NVIDIA is seeking a Senior DL Algorithms Engineer to optimize LLM/Omni models and enhance performance across its software stack. The ideal candidate... ...3+ years of experience in deep learning, specifically in inference. This role involves profiling, analyzing bottlenecks, and...
NVIDIA
Santa Clara, CA
4 days ago
Senior Inference Engineer: Real-Time Video AI on GPUs
Pika is looking for a Senior Inference Engineer in Palo Alto to enhance the performance of AI-driven products. This pivotal role involves designing and optimizing inference pipelines, applying advanced techniques to improve model speed and efficiency. The ideal candidate...
Pika
Palo Alto, CA
3 days ago
Inference Engineer: Real-Time AI GPU Acceleration
Pika is seeking an experienced Inference Engineer to enhance the performance of its AI-driven products in Palo Alto, CA. The role encompasses optimizing inference pipelines, applying advanced acceleration techniques, and collaborating on AI model deployment. The ideal...
Pika
Palo Alto, CA
5 days ago
Senior ML Inference Engineer - Platform
$128.7k - $261.3k
About the Team The Model Deployment & Inference Solutions team in GM AV deploys machine learning... ...currently performed manually by engineers. Build the developer experience that ML model... ...Qualifications Experience building agentic or LLM‑powered developer tooling. Experience...
Local area
Remote work
Flexible hours
Shift work
General Motors
Mountain View, CA
3 days ago
TPU Systems Engineer — High-Performance ML Inference Equity
$180k - $250k
A leading AI infrastructure firm is seeking a TPU Systems Engineer to develop high-performance systems using JAX, XLA, and Pallas. This role involves pushing large-model workloads on TPU hardware and optimizing performance across the stack. Candidates should have at least...
RadixArk
Palo Alto, CA
2 days ago
SR Principal Software Engineer - LLM Engineering
...companies. As a Senior Principal Software Engineer at JPMorganChase within the Commercial &... ...deployment and optimization using Model Inference servers such as Triton Inference Server and... ...success architecting and deploying LLM & GNN solutions on AWS (e.g., SageMaker,...
TwinThread LLC
Palo Alto, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to LLM Inference Engineer. Be the first to apply!