LLM Inference Engineer: Scalable, Low-Latency Serving
Hippocratic AI
Hippocratic AI is seeking an experienced LLM Inference Engineer in Palo Alto to optimize large language model serving infrastructure. The role requires expertise in optimizing LLM inference systems at scale and deploying efficient, scalable LLM systems in production. The ideal candidate will design multi-node architectures, apply quantization techniques, and benchmark system performance. With a focus on safety in healthcare, this role offers the chance to work with a team of healthcare and AI leaders. #J-18808-Ljbffr Hippocratic AI
$126 per hour
...healthcare‑only, safety‑focused LLM—a breakthrough platform... ...an experienced LLM Inference Engineer to optimize our large language model serving infrastructure. The... ...deploying efficient, scalable LLM systems in production... ...decoding and other latency‑optimization strategies...SuggestedWork at office- ...senior member of the LLM inference framework team, you will... ...driving performance, scalability, and reliability, enabling... ...platform for LLM serving. This role sits at the... ...intersection of inference engines, distributed systems,... ...terms of throughput, latency, memory movement, and...Suggested
$135.8k - $237.05k
...View, CA, USA Senior Backend Engineer, ML Inference Systems Location Mountain View... ...performance, reliability, and scalability of inference systems. Join us... ...inference platform, with a focus on low-latency, high-throughput serving infrastructure Partner with ML...SuggestedWork at officeWorldwideRelocation package- ...Moveworks’ Reasoning Engine and natural... ...the long-term scalability of our core AI... ...the entire ML/LLM lifecycle. This... ...distributed training and inference, model... ...frameworks, and LLM latency optimization.... ...team builds serve as the foundation... ...to ultra-low-latency inference...SuggestedWork at officeRemote workFlexible hours
$207k - $300k
Senior Research Engineer, On-Device Inference, Robotics, DeepMind corporate_fare DeepMind place Mountain... ...Gemini Robotics models for deployment in low-latency on-device applications. You will... ...that is representative of the users we serve, creating a culture of belonging, and...SuggestedFull time- ...Department: Backend Engineer · Work type: On-Site... ...real-time multimodal LLM for real life,... ...building performant, scalable, and resilient distributed... ..., optimize for latency and throughput, and... ...support high-throughput, low-latency AI model inference and data services....Full time
- ...industry-leading training and inference speeds and empowers... ...a Senior Performance Engineer to join our Product... ...inference performance and will serve as our resident expert... ..., time to first token, latency under concurrency, and... ...vLLM, SGLang, TensorRT-LLM), GPU kernel-level...Contract workShift work
- MatX is seeking a skilled software engineer to build custom silicon for AI language models in Mountain View, California. The role... ...building libraries for memory management and developing an LLM inference serving stack. Benefits include generous salary, equity offerings,...Remote job
$184k - $287.5k
NVIDIA is recruiting a Senior Inference Engineer to advance AIConfigurator ( a... ...configurations for large-scale LLM inference. This role integrates GPU systems, model serving, performance modeling, and production... ...by optimizing efficiency, latency, parallelism, and resource...Full time- ...Moveworks' Reasoning Engine and natural language capabilities... ...for building and serving LLM's at Moveworks. This... ...training and inference pipeline for large language... ...monitoring framework, LLM latency optimization, etc.... ...solving many challenges on scalability of services as well as...Work at officeRemote workFlexible hours
- ...for a Machine Learning Engineer to help build cutting... ...for building and serving LLM’s at Moveworks. This... ...distributed training and inference pipeline for large... ...monitoring framework, LLM latency optimization, etc.... ...solving many challenges on scalability of services as well...
- Garuda Ventures in Palo Alto is seeking Robotics Software Engineers to build high-performance middleware and runtime systems for robotic platforms. You will design low-latency execution frameworks and optimize inter-process communication. Successful candidates will develop...
$120k - $250k
...Runtime Engineer Mountain View, CA What MatX Is Building... ...silicon for large-language-model inference and training, with HW/SW co-... ...consumers Build the LLM inference serving stack — paged KV cache, continuous... ..., paged, arena) or other low-level memory work ML framework...Full timeContract workWork experience placementLocal areaRemote workMonday to FridayFlexible hours- ...that fuels it, recursively accelerating the path to artificial superintelligence. We are interested in best-in-class engineers to focus on a variety of challenges relating to scaling, low-level optimization, and core infrastructure for LLM training and inference....
- ...Performance Engineer RadixArk is hiring a Performance Engineer... ..., CA — someone who can push LLM inference and training systems to the... ...RadixArk infrastructure stack: latency, throughput, GPU utilization... ...cost-per-token and serving efficiency Partner with customers...Flexible hours
- NVIDIA Gruppe is looking for a skilled engineer to join their TensorRT Edge-LLM team in Santa Clara, California. The role involves developing a state-of-the-art inference framework for large language models and optimizing it for real-time performance on embedded platforms...
- ...hiring a Forward Deployed Engineer to embed directly with... ..., sub‑100ms latency budgets, regulated data... ...targets. Tune streaming and inference paths to hit sub‑100ms... ...10 customers can self‑serve from. What You Bring 4+... ...LiveKit, Pipecat, SIP, or low‑latency audio/video. Experience...Contract workWork at officeRemote workVisa sponsorship
$184k - $356.5k
...leading AI computing company in California is seeking a Senior Deep Learning Software Engineer focused on performance optimization of LLM models. You will analyze and enhance LLM inference performance, working in cross-collaborative teams to implement cutting-edge...Full time$240k - $400k
...artificial intelligence. Role Summary As our Founding Engineer, you will own a zero‑to‑one product and its infrastructure... .... Familiarity with LangGraph is a plus. Stand up inference paths with low latency serving and token‑level observability Productionize prompt pipelines...Visa sponsorship$184k - $287.5k
Senior DL Algorithms Engineer - Inference Performance page is loaded## Senior DL Algorithms Engineer - Inference Performancelocations... ..., fix bugs and deliver production code to TRT-LLM, NVIDIA’s open-source inference serving library.* Profile and analyze bottlenecks across...$120.1k - $225.7k
...End-to-End Inference Optimization: Lead... ...for Large Models (LLM, Multimodal); focus... ...throughput and minimize latency. Heterogeneous... ...Science, Electronic Engineering, AI, or related... ...understanding of low-level programming... ...allow us to better serve our users and the...Relocation package$147.4k - $272.1k
Apple Inc. is seeking a Software Development Engineer for Siri Runtime Systems and Interaction in Cupertino, California. This role involves... ...and integrating next-generation Siri experiences, focusing on low-latency interactions and system performance optimization. Ideal...$180k
A cutting-edge AI firm in California is seeking engineers focused on optimizing AI model inference. Candidates should have experience with Python, Rust, and... ...optimizations. The role involves building reliable serving systems and contributing to innovative AI technologies...$300k - $400k
...frontier model training and inference fast, efficient, and... ...Kubernetes, minimizing latency and maximizing... ...the RL loop tight and low-latency Build fast... ...shifting, scheduling, and serving architecture at production... ...best — the scientists, engineers, and problem-solvers who...Visa sponsorshipFlexible hoursShift work- ...your impact. We look for low-ego individuals who... ...how work gets done. GTM Engineer - ABM, Advertising and... ...platforms to build the scalable technical backbone powering... ...touchpoints at scale. Serve as the primary... ...emerging AI tools (e.g., new LLM capabilities, AI-native...
$174k - $252k
...from research concepts to scalable production systems. Experience... ...ML projects, including LLM training, inference, and fine-tuning. Experience... ...are looking for a research engineer for the Frontier Safety... ...representative of the users we serve, creating a culture of...Full time- ...(App Store, Global Search, Game Center). You’ll build scalable, reliable systems that serve millions of daily users. Key Responsibilities:... ...services for recommendation and search with a focus on low latency and high availability. Implement multi-level caching...For contractors
- Moveworks is seeking a Machine Learning Engineer in Mountain View, California, to design and optimize scalable ML infrastructure for large language models. This pivotal role requires collaboration with cross-functional teams to enhance AI product scalability. The ideal...
- ...We are seeking a Production Engineer to take complete, end-to-end... ...on AWS to ensure exceptional scalability and fault tolerance. Lead... ...unique challenges of real-time, low-latency systems, particularly... ...designing and operating robust self-serve developer platforms,...
$184k - $287.5k
...are seeking highly skilled and motivated software engineers to join us and build AI inference systems that serve large-scale models with extreme efficiency. You’ll... ...Desired Experience Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang). Hands‑on...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to LLM Inference Engineer: Scalable, Low-Latency Serving. Be the first to apply!


