Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

LLM Inference Engineer: Scalable, Low-Latency Serving

Hippocratic AI

Hippocratic AI is seeking an experienced LLM Inference Engineer in Palo Alto to optimize large language model serving infrastructure. The role requires expertise in optimizing LLM inference systems at scale and deploying efficient, scalable LLM systems in production. The ideal candidate will design multi-node architectures, apply quantization techniques, and benchmark system performance. With a focus on safety in healthcare, this role offers the chance to work with a team of healthcare and AI leaders. #J-18808-Ljbffr Hippocratic AI

Vacancy posted 6 hours ago
Similar jobs that could be interesting for youBased on the LLM Inference Engineer: Scalable, Low-Latency Serving in Palo Alto, CA vacancy
  • $126 per hour

     ...healthcare‑only, safety‑focused LLM—a breakthrough platform...  ...an experienced LLM Inference Engineer to optimize our large language model serving infrastructure. The...  ...deploying efficient, scalable LLM systems in production...  ...decoding and other latency‑optimization strategies... 
    Suggested
    Work at office

    Hippocratic AI

    Palo Alto, CA
    6 hours ago
  •  ...senior member of the LLM inference framework team, you will...  ...driving performance, scalability, and reliability, enabling...  ...platform for LLM serving. This role sits at the...  ...intersection of inference engines, distributed systems,...  ...terms of throughput, latency, memory movement, and... 
    Suggested

    Advanced Micro Devices

    Santa Clara, CA
    3 days ago
  • $135.8k - $237.05k

     ...View, CA, USA Senior Backend Engineer, ML Inference Systems Location Mountain View...  ...performance, reliability, and scalability of inference systems. Join us...  ...inference platform, with a focus on low-latency, high-throughput serving infrastructure Partner with ML... 
    Suggested
    Work at office
    Worldwide
    Relocation package

    Unity Technologies

    Mountain View, CA
    1 day ago
  •  ...Moveworks’ Reasoning Engine and natural...  ...the long-term scalability of our core AI...  ...the entire ML/LLM lifecycle. This...  ...distributed training and inference, model...  ...frameworks, and LLM latency optimization....  ...team builds serve as the foundation...  ...to ultra-low-latency inference... 
    Suggested
    Work at office
    Remote work
    Flexible hours

    ServiceNow

    Mountain View, CA
    7 days ago
  • $207k - $300k

    Senior Research Engineer, On-Device Inference, Robotics, DeepMind corporate_fare DeepMind place Mountain...  ...Gemini Robotics models for deployment in low-latency on-device applications. You will...  ...that is representative of the users we serve, creating a culture of belonging, and... 
    Suggested
    Full time

    Google Inc.

    Mountain View, CA
    2 days ago
  •  ...Department: Backend Engineer · Work type: On-Site...  ...real-time multimodal LLM for real life,...  ...building performant, scalable, and resilient distributed...  ..., optimize for latency and throughput, and...  ...support high-throughput, low-latency AI model inference and data services.... 
    Full time

    Neara

    Palo Alto, CA
    4 days ago
  •  ...industry-leading training and inference speeds and empowers...  ...a Senior Performance Engineer to join our Product...  ...inference performance and will serve as our resident expert...  ..., time to first token, latency under concurrency, and...  ...vLLM, SGLang, TensorRT-LLM), GPU kernel-level... 
    Contract work
    Shift work

    CEREBRAS SYSTEMS INC.

    Sunnyvale, CA
    1 day ago
  • MatX is seeking a skilled software engineer to build custom silicon for AI language models in Mountain View, California. The role...  ...building libraries for memory management and developing an LLM inference serving stack. Benefits include generous salary, equity offerings,... 
    Remote job

    MatX

    Mountain View, CA
    4 days ago
  • $184k - $287.5k

    NVIDIA is recruiting a Senior Inference Engineer to advance AIConfigurator ( a...  ...configurations for large-scale LLM inference. This role integrates GPU systems, model serving, performance modeling, and production...  ...by optimizing efficiency, latency, parallelism, and resource... 
    Full time

    NVIDIA

    Santa Clara, CA
    1 day ago
  •  ...Moveworks' Reasoning Engine and natural language capabilities...  ...for building and serving LLM's at Moveworks. This...  ...training and inference pipeline for large language...  ...monitoring framework, LLM latency optimization, etc....  ...solving many challenges on scalability of services as well as... 
    Work at office
    Remote work
    Flexible hours

    ServiceNow

    Mountain View, CA
    2 days ago
  •  ...for a Machine Learning Engineer to help build cutting...  ...for building and serving LLM’s at Moveworks. This...  ...distributed training and inference pipeline for large...  ...monitoring framework, LLM latency optimization, etc....  ...solving many challenges on scalability of services as well... 

    Moveworks

    Mountain View, CA
    16 hours ago
  • Garuda Ventures in Palo Alto is seeking Robotics Software Engineers to build high-performance middleware and runtime systems for robotic platforms. You will design low-latency execution frameworks and optimize inter-process communication. Successful candidates will develop... 

    Garuda Ventures

    Palo Alto, CA
    3 days ago
  • $120k - $250k

     ...Runtime Engineer Mountain View, CA What MatX Is Building...  ...silicon for large-language-model inference and training, with HW/SW co-...  ...consumers Build the LLM inference serving stack — paged KV cache, continuous...  ..., paged, arena) or other low-level memory work ML framework... 
    Full time
    Contract work
    Work experience placement
    Local area
    Remote work
    Monday to Friday
    Flexible hours

    MatX

    Mountain View, CA
    2 days ago
  •  ...that fuels it, recursively accelerating the path to artificial superintelligence. We are interested in best-in-class engineers to focus on a variety of challenges relating to scaling, low-level optimization, and core infrastructure for LLM training and inference.... 

    Ricursive Intelligence

    Palo Alto, CA
    16 hours ago
  •  ...Performance Engineer RadixArk is hiring a Performance Engineer...  ..., CA — someone who can push LLM inference and training systems to the...  ...RadixArk infrastructure stack: latency, throughput, GPU utilization...  ...cost-per-token and serving efficiency Partner with customers... 
    Flexible hours

    RadixArk

    Palo Alto, CA
    16 hours ago
  • NVIDIA Gruppe is looking for a skilled engineer to join their TensorRT Edge-LLM team in Santa Clara, California. The role involves developing a state-of-the-art inference framework for large language models and optimizing it for real-time performance on embedded platforms... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  •  ...hiring a Forward Deployed Engineer to embed directly with...  ..., sub‑100ms latency budgets, regulated data...  ...targets. Tune streaming and inference paths to hit sub‑100ms...  ...10 customers can self‑serve from. What You Bring 4+...  ...LiveKit, Pipecat, SIP, or low‑latency audio/video. Experience... 
    Contract work
    Work at office
    Remote work
    Visa sponsorship

    39 Ai, Inc.

    Mountain View, CA
    3 days ago
  • $184k - $356.5k

     ...leading AI computing company in California is seeking a Senior Deep Learning Software Engineer focused on performance optimization of LLM models. You will analyze and enhance LLM inference performance, working in cross-collaborative teams to implement cutting-edge... 
    Full time

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $240k - $400k

     ...artificial intelligence. Role Summary As our Founding Engineer, you will own a zero‑to‑one product and its infrastructure...  .... Familiarity with LangGraph is a plus. Stand up inference paths with low latency serving and token‑level observability Productionize prompt pipelines... 
    Visa sponsorship

    Pear VC

    Palo Alto, CA
    4 days ago
  • $184k - $287.5k

    Senior DL Algorithms Engineer - Inference Performance page is loaded## Senior DL Algorithms Engineer - Inference Performancelocations...  ..., fix bugs and deliver production code to TRT-LLM, NVIDIA’s open-source inference serving library.* Profile and analyze bottlenecks across... 

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $120.1k - $225.7k

     ...End-to-End Inference Optimization: Lead...  ...for Large Models (LLM, Multimodal); focus...  ...throughput and minimize latency. Heterogeneous...  ...Science, Electronic Engineering, AI, or related...  ...understanding of low-level programming...  ...allow us to better serve our users and the... 
    Relocation package

    Tencent

    Palo Alto, CA
    2 days ago
  • $147.4k - $272.1k

    Apple Inc. is seeking a Software Development Engineer for Siri Runtime Systems and Interaction in Cupertino, California. This role involves...  ...and integrating next-generation Siri experiences, focusing on low-latency interactions and system performance optimization. Ideal... 

    Apple Inc.

    Cupertino, CA
    3 days ago
  • $180k

    A cutting-edge AI firm in California is seeking engineers focused on optimizing AI model inference. Candidates should have experience with Python, Rust, and...  ...optimizations. The role involves building reliable serving systems and contributing to innovative AI technologies... 

    Pantera Capital

    Palo Alto, CA
    1 day ago
  • $300k - $400k

     ...frontier model training and inference fast, efficient, and...  ...Kubernetes, minimizing latency and maximizing...  ...the RL loop tight and low-latency Build fast...  ...shifting, scheduling, and serving architecture at production...  ...best — the scientists, engineers, and problem-solvers who... 
    Visa sponsorship
    Flexible hours
    Shift work

    Periodic Labs

    Menlo Park, CA
    4 days ago
  •  ...your impact. We look for low-ego individuals who...  ...how work gets done. GTM Engineer - ABM, Advertising and...  ...platforms to build the scalable technical backbone powering...  ...touchpoints at scale. Serve as the primary...  ...emerging AI tools (e.g., new LLM capabilities, AI-native... 

    Snowflake

    Menlo Park, CA
    8 hours ago
  • $174k - $252k

     ...from research concepts to scalable production systems. Experience...  ...ML projects, including LLM training, inference, and fine-tuning. Experience...  ...are looking for a research engineer for the Frontier Safety...  ...representative of the users we serve, creating a culture of... 
    Full time

    Google Inc.

    Mountain View, CA
    1 day ago
  •  ...(App Store, Global Search, Game Center). You’ll build scalable, reliable systems that serve millions of daily users.   Key Responsibilities:...  ...services for recommendation and search with a focus on low latency and high availability. Implement multi-level caching... 
    For contractors

    OPPO US Research Center

    Palo Alto, CA
    2 days ago
  • Moveworks is seeking a Machine Learning Engineer in Mountain View, California, to design and optimize scalable ML infrastructure for large language models. This pivotal role requires collaboration with cross-functional teams to enhance AI product scalability. The ideal... 

    Moveworks

    Mountain View, CA
    16 hours ago
  •  ...We are seeking a Production Engineer to take complete, end-to-end...  ...on AWS to ensure exceptional scalability and fault tolerance. Lead...  ...unique challenges of real-time, low-latency systems, particularly...  ...designing and operating robust self-serve developer platforms,... 

    Sanas

    Palo Alto, CA
    16 hours ago
  • $184k - $287.5k

     ...are seeking highly skilled and motivated software engineers to join us and build AI inference systems that serve large-scale models with extreme efficiency. You’ll...  ...Desired Experience Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang). Hands‑on... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to LLM Inference Engineer: Scalable, Low-Latency Serving. Be the first to apply!