LLM Inference Engineer
Dormont Manufacturing Co
About Us Hippocratic AI is the leading generative AI company in healthcare. We have the only system that can have safe, autonomous, clinical conversations with patients. We have trained our own LLMs as part of our Polaris constellation, resulting in a system with over 99.9% accuracy. About the Role We're seeking an experienced LLM Inference Engineer to optimize our large language model (LLM) serving infrastructure. The ideal candidate has: Extensive hands‑on experience with state‑of‑the‑art inference optimization techniques A track record of deploying efficient, scalable LLM systems in production environments What You'll Do Design and implement multi-node serving architectures for distributed LLM inference Optimize multi-LoRA serving systems Apply advanced quantization techniques (FP4/FP6) to reduce model footprint while preserving quality Implement speculative decoding and other latency optimization strategies Develop disaggregated serving solutions with optimized caching strategies for prefill and decoding phases Continuously benchmark and improve system performance across various deployment scenarios and GPU types What You Bring Must-Have: Experience optimizing LLM inference systems at scale Proven expertise with distributed serving architectures for large language models Hands‑on experience implementing quantization techniques for transformer models Strong understanding of modern inference optimization methods, including: Speculative decoding techniques with draft models Eagle speculative decoding approaches Proficiency in Python and C++ Experience with CUDA programming and GPU optimization Nice‑to‑Have: Contributions to open‑source inference frameworks such as vLLM , SGLang , or TensorRT-LLM Experience with custom CUDA kernels Track record of deploying inference systems in production environments Deep understanding of performance optimization systems
- Show us what you've built: Tell us about an LLM inference or training project that makes you proud! Whether you've optimized inference pipelines to achieve breakthrough performance, designed innovative training techniques, or built systems that scale to billions of parameters – we want to hear your story._
- Open source contributor? Even better! If you've contributed to projects like vllm, sglang, lmdeploy or similar LLM optimization frameworks, we'd love to see your PRs. Your contributions to these communities demonstrate exactly the kind of collaborative innovation we value._
- Hippocratic AI is looking for an experienced LLM Inference Engineer based in Palo Alto, CA, to optimize their large language model (LLM) serving infrastructure. You'll design multi-node serving architectures and implement advanced optimization techniques while collaborating...Suggested
- ScOp Venture Capital is looking for an ML Systems Engineer to optimize LLM inference systems crucial for their AI platform. The role focuses on enhancing performance and efficiency via low-level systems optimization, directly impacting industry leader processes in semiconductor...Suggested
- NVIDIA Gruppe is looking for a skilled engineer to join their TensorRT Edge-LLM team in Santa Clara, California. The role involves developing a state-of-the-art inference framework for large language models and optimizing it for real-time performance on embedded platforms...Suggested
- MatX is seeking a skilled software engineer to build custom silicon for AI language models in Mountain View, California. The role requires... ...building libraries for memory management and developing an LLM inference serving stack. Benefits include generous salary, equity...SuggestedRemote job
- Modular Mailing Systems, Inc. is seeking an experienced Performance Engineer to optimize LLM inference on their cloud platform. This pivotal role involves building optimization infrastructures and collaborating with teams to enhance performance across GPUs and ASICs. The...SuggestedRemote jobFlexible hours
$198k - $286k
...About the role At Modular, we optimize inference from kernel to cloud on one unified stack... ...optimizations across kernels, the inference engine, and distributed systems so that customer... ...direction of Modular Cloud, delivering LLM performance on the Pareto frontier for agentic...Remote jobWork experience placementWork at officeLocal areaFlexible hours$124.8k - $235k
...businesses. What the Role Entails End-to-End Inference Optimization: Lead the optimization of... ...inference pipeline for Large Models (LLM, Multimodal); focus on KV Cache storage... ...Ph.D. in Computer Science, Electronic Engineering, AI, or related fields; significant professional...Full timeRelocation package- ...Frontier Group sits at the leading edge of what’s possible with LLM inference on heterogeneous hardware. Our charter spans the full stack:... ...unique computational fabric. We are an applied research and engineering team that moves fast, ships real systems, and works directly...Full time
- ...infrastructure company in California is seeking a Member of Technical Staff — Inference to design and optimize large-scale AI inference systems. The role demands 5+ years in systems engineering and expertise in large-scale inference systems. Successful candidates will...Flexible hours
$192.6k - $305.6k
...world impact converge at scale. We’re hiring a Staff Backend Engineer to build and operate the infrastructure those models depend on... ...with a focus on the performance, reliability, and scalability of inference systems. Join us and help influence how billions of gaming...Work at officeWorldwideRelocation package$184k - $287.5k
...Overview We are seeking highly skilled and motivated software engineers to join us and build AI inference systems that serve large-scale models with extreme... ...Desired Experience Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang). Hands‑on work...$135k
United States Digital Space LLC in Palo Alto seeks an Application Software Engineer to develop high-performance AI inference systems. This role emphasizes the design and optimization of large-scale systems used for mission-critical applications. The ideal candidate possesses...Remote work- ...financial institution is seeking a Senior Principal Software Engineer to provide engineering expertise within the Commercial & Investment... ...solutions, implementing MLOps practices, and optimizing model inference for high-performance applications. The ideal candidate will...
- ...workflow automation with Moveworks’ Reasoning Engine and natural language capabilities, we... ...scaling the end-to-end systems for the entire ML/LLM lifecycle. This includes our infrastructure for distributed training and inference, model evaluation frameworks, and LLM latency...Work at officeRemote workFlexible hours
- Poetiq is seeking an AI Engineer in Los Altos, California. In this role, you will build a robust platform for AI intelligence, relying on foundation models and complex LLM interactions. Your skills in designing high-performance infrastructure and working autonomously on...
$190k - $250k
...Location Type Hybrid Department AI We are looking for an AI Inference engineer to join our growing team. Our current stack is Python, Rust,... ...respond to system outages Explore novel research and implement LLM inference optimizations Qualifications Experience with ML...Full time- A leading technology company in California is seeking an Engineering Manager to lead the development of cutting-edge LLM/VLM technologies. In this hands-on leadership role, you will manage a team responsible for optimizing runtime and frameworks, while collaborating with...
- ...we are building a mathematical reasoning engine that operates with absolute precision. While... ...to our distributed training loops and inference engines. We are seeking engineers who view... ...reinforcement learning and low-latency LLM production traffic. Tune the inference engine...
- ...to deliver industry-leading training and inference speeds; over 10 times faster than GPU-based... ...Role We are hiring a Senior Performance Engineer to join our Product team. You are an... ...source inference stacks (vLLM,SGLang,TensorRT-LLM), GPU kernel-level optimization...Contract workShift work
- Sarah Smith Fund in Palo Alto is seeking an Inference Engineer to enhance the performance of AI-driven products. In this technical role, you will design and optimize inference pipelines, implementing cutting-edge acceleration techniques to improve model speed and efficiency...
$152k - $241.5k
We are looking for a Senior DL Algorithms Engineer for LLM/Omni model optimizations! Seeking senior engineers who are mindful of performance... ...models (like Nemotron and Cosmos) on NVIDIA’s accelerated inference SW stack. Contribute new features, fix bugs and deliver...$152k - $241.5k
.... NVIDIA is seeking top‑tier AI Compiler Engineers to drive innovation within our world‑class... ...problems for AI workloads (both inference and training) and successfully transition... .../or custom AI accelerator architectures. LLM Knowledge: Deep understanding of Large Language...- Cerebras is seeking a Staff Engineer to join their Inference Platform team in Sunnyvale, California. This role involves leading and contributing to projects focused on cloud and ML components, with responsibilities spanning design, mentoring, and coding. The ideal candidate...
$170.5k - $315.49k
...models fast on the hardware people actually own. You optimize inference engines (llama.cpp, vLLM) for constrained local and edge environments... ...Python; comfortable reading systems‑level code Understands how LLM inference works (attention, KV cache, decoding) Has profiled...Local areaImmediate startShift work- NVIDIA is seeking a Senior DL Algorithms Engineer to optimize LLM/Omni models and enhance performance across its software stack. The ideal candidate... ...3+ years of experience in deep learning, specifically in inference. This role involves profiling, analyzing bottlenecks, and...
- Pika is looking for a Senior Inference Engineer in Palo Alto to enhance the performance of AI-driven products. This pivotal role involves designing and optimizing inference pipelines, applying advanced techniques to improve model speed and efficiency. The ideal candidate...
- Pika is seeking an experienced Inference Engineer to enhance the performance of its AI-driven products in Palo Alto, CA. The role encompasses optimizing inference pipelines, applying advanced acceleration techniques, and collaborating on AI model deployment. The ideal...
$128.7k - $261.3k
About the Team The Model Deployment & Inference Solutions team in GM AV deploys machine learning... ...currently performed manually by engineers. Build the developer experience that ML model... ...Qualifications Experience building agentic or LLM‑powered developer tooling. Experience...Local areaRemote workFlexible hoursShift work$180k - $250k
A leading AI infrastructure firm is seeking a TPU Systems Engineer to develop high-performance systems using JAX, XLA, and Pallas. This role involves pushing large-model workloads on TPU hardware and optimizing performance across the stack. Candidates should have at least...- ...companies. As a Senior Principal Software Engineer at JPMorganChase within the Commercial &... ...deployment and optimization using Model Inference servers such as Triton Inference Server and... ...success architecting and deploying LLM & GNN solutions on AWS (e.g., SageMaker,...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to LLM Inference Engineer. Be the first to apply!

