AI Inference Engineer

Triune Infomatics Inc

Role: AI Inference Engineer
Location: San Jose, CA
Duration: 6 to 12 Months

Overview: We are seeking a highly skilled AI Inference Engineer to join our team and drive the performance, scalability, and reliability of our large-scale model serving infrastructure. This role sits at the intersection of systems engineering, GPU optimization, and distributed infrastructure, and is ideal for someone who thrives on squeezing maximum performance out of production AI workloads.
The ideal candidate has hands-on experience building or operating production-grade inference serving systems and is comfortable working close to the hardware, from CUDA/ROCm kernels to distributed multi-node, multi-GPU clusters serving large language models at scale.

Key Responsibilities:
Inference Serving & Optimization

Build, operate, and optimize production model-serving stacks using frameworks such as vLLM, SGLang, Triton Inference Server, TensorRT-LLM, TorchServe, or KServe
Develop and maintain custom high throughput microservices for model inference using C++, Python, and Rust

GPU & Hardware Acceleration

Write and optimize custom GPU kernels using CUDA, ROCm, or Triton
Apply deep understanding of GPU architecture, including memory hierarchies and tensor cores, to improve compute efficiency

LLM Inference Internals

Optimize prefill and decode stages, attention mechanisms, and continuous batching
Implement and tune quantization, speculative decoding, tensor parallelism, pipeline parallelism, and Mixture of Experts (MoE) serving strategies

Memory & KV Cache Management

Design and implement KV cache optimization strategies, including Paged Attention, chunked prefill, prefix caching, and quantized KV
Develop cache transfer and offload strategies to manage memory pressure under high-volume, irregular workloads

Distributed Systems & Infrastructure

Build and operate fault-tolerant, high-concurrency serving systems deployed on Kubernetes, OpenShift, Helm, or similar orchestration platforms
Implement tensor parallelism, pipeline parallelism, and distributed computing across multi-node, multi-GPU clusters

Distributed Serving Platform (Dynamo)

Contribute to distributed serving architecture components including frontend, router, worker discovery, multi-model routing, and health checks
Build and maintain OpenAI-compatible endpoints across multiple backends, including SGLang, TensorRT-LLM, and vLLM

Performance & Reliability

Conduct deep profiling and benchmarking to identify and resolve latency and throughput regressions
Build telemetry-driven observability platforms ensuring high availability, load balancing, and dynamic request scheduling
Model Support
Bring up and support a broad range of model classes in production, including decoder-only LLMs, MoE models, hybrid attention/SSM models, multimodal models, embedding models, reward models, and classification models

Required Qualifications:

Proven experience with production model-serving frameworks (vLLM, SGLang, Triton Inference Server, TensorRT-LLM, TorchServe, KServe, or custom runtimes)
Strong proficiency in C++, Python, and Rust for building high-performance, memory-efficient systems
Hands-on experience writing GPU kernels using CUDA and/or ROCm
Solid understanding of LLM inference internals, including attention mechanisms, KV cache management, continuous batching, and quantization
Experience with distributed, multi-node, multi-GPU serving environments
Experience deploying and managing services on Kubernetes, OpenShift, or similar orchestration platforms
Strong background in performance profiling, benchmarking, and debugging latency or throughput issues

Preferred Qualifications:

Direct experience working with NVIDIA Dynamo or similar distributed serving architectures (router, worker discovery, multi-model routing)
Experience supporting diverse model types in production, including MoE, multimodal, and hybrid attention/SSM architectures
Familiarity with OpenAI-compatible API design and implementation
Experience with telemetry and observability tooling for large-scale GPU infrastructure

Apply

Vacancy posted 12 hours ago

Similar jobs that could be interesting for youBased on the AI Inference Engineer in San Jose, CA vacancy

Sr. Lead AI Engineer (Inference Optimization, FM hosting, AI Platform)
$229.9k - $262.4k
...Sr. Lead AI Engineer (Inference Optimization, FM hosting, AI Platform) Overview: At Capital One, we are creating responsible and reliable AI systems, changing banking for good. For years, Capital One has been an industry leader in using machine learning to create...
Suggested
Full time
Part time
Local area
Capital One Financial Corp
San Jose, CA
4 days ago
Senior AI Kernel & Inference Engineer
A leading technology company is seeking a Senior AI Software Engineer to join their team in Santa Clara, California. In this role, you will... ...innovate and develop groundbreaking AI systems software for inference applications including deep learning framework...
Suggested
NVIDIA Corporation
Santa Clara, CA
1 day ago
AI Inference Performance Engineer
$152k - $241.5k
We optimize and benchmark GenAI inference on NVIDIA's latest accelerators, defining the industry... ...at the intersection of GPU performance engineering and public accountability. What You Will... ..., agentic workflows, and other emerging AI use cases. Collaborate with framework...
Suggested
NVIDIA Gruppe
Santa Clara, CA
4 days ago
High-Performance AI Inference Engineer (TensorRT)
$124k - $195.5k
NVIDIA Gruppe is looking for a passionate Software Engineer to join its TensorRT team in Santa Clara, California. This role involves designing and developing high-performance AI inference solutions while contributing to performance optimizations and collaborating with...
Suggested
NVIDIA Gruppe
Santa Clara, CA
1 day ago
AI Inference Performance Engineer — Scale LLMs & GPU Clusters
$124k - $195.5k
NVIDIA Corporation is seeking an AI Inference Performance Engineer - New College Grad 2026 in Santa Clara. This role involves optimizing AI inference benchmarks using NVIDIA’s accelerators and working with various teams on performance enhancements. Applicants should have...
Suggested
NVIDIA Corporation
Santa Clara, CA
3 days ago
Senior AI Inference Engineer - High-Performance LLM Serving
$152k - $241.5k
NVIDIA Gruppe is seeking a Senior Software Engineer - AI Inference in Santa Clara, California. This role involves enhancing open-source LLM serving optimizations and implementing high-performance runtime capabilities. Candidates should have 5+ years of experience in building...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior GPU AI Inference Engineer - Triton & Dynamo
A leading technology company is seeking a Senior System Software Engineer to develop GPU-accelerated AI inference serving software. The ideal candidate will have over 5 years of experience with deep learning software, strong skills in Rust and C++, and a collaborative...
NVIDIA Corporation
Santa Clara, CA
17 hours ago
Senior AI Inference Performance Engineer (GPU/Cluster)
$152k - $241.5k
NVIDIA Gruppe is seeking a talented individual to optimize and benchmark GenAI inference using the latest acceleration technologies. The role involves driving industry benchmark results and architecting distributed inference systems. Required qualifications include a relevant...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior AI Inference Kernel Engineer
$184k - $287.5k
NVIDIA Gruppe in Santa Clara is seeking an AI Systems Engineer to innovate and develop cutting-edge technologies in the AI inference software stack. Candidates should hold a Master's degree and possess over 6 years of experience in ML/DL systems development. The role involves...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior AI Inference Compiler Engineer
$152k - $241.5k
...recently, GPU deep learning ignited modern AI — the next era of computing — with the... ...looking for an AI & Deep Learning Compiler Engineer. NVIDIA is hiring software engineers for... ...our DLC has been the backbone of NVIDIA’s inference engine, spanning across data centers,...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Principal AI Inference Engineer Open-Source & GPU-Focused
$272k - $431.25k
NVIDIA Gruppe is looking for a Principal Software Engineer to advance open-source AI inference. This hands-on role emphasizes running high-performance inference on NVIDIA platforms and involves collaboration across various teams. Key responsibilities include optimizing...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior AI Inference & Distributed Systems Engineer
Advanced Micro Devices is seeking a strategic software engineering lead in Santa Clara, California. This role involves improving application... .... Key responsibilities include developing techniques for inference optimization and supporting the ROCm ecosystem expansion. A Bachelor...
Advanced Micro Devices
Santa Clara, CA
3 days ago
Senior AI Systems Engineer: Inference Kernels & Runtimes
$184k - $287.5k
NVIDIA Gruppe is seeking talented AI systems engineers to advance innovative technologies in AI inference systems software. This role involves developing cutting-edge libraries, code generators, and kernel technologies for NVIDIA's architecture, emphasizing high-impact...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Principal AI Inference Systems Engineer
...next-generation computing experiences-from AI and data centers, to PCs, gaming and... ...AMD is looking for a Senior Staff AI Infra Engineer who is passionate about improving the performance... ...Optimize and accelerate LLM training and inference on AMD GPUs, improving kernel,...
Advanced Micro Devices , Inc.
Santa Clara, CA
17 hours ago
AI Platform Engineer, Training and Inference
...AI Platform Engineer - Training & Inference Saviynt's AI-powered identity platform manages and governs human and non-human access to all of an organization's applications, data, and business processes. Customers trust Saviynt to safeguard their digital assets, drive...
Saviynt
Milpitas, CA
3 days ago
Staff AI Cloud Platform Engineer - Inference & Training
A leading AI technology company in Sunnyvale, California, is seeking a skilled software engineer to optimize its AI cloud platform for model training and inference. In this role, you'll enhance deployment efficiency and ensure system reliability and scalability. The ideal...
Cerebras
Sunnyvale, CA
17 hours ago
Senior AI Inference Systems Engineer: GPU-Optimized, Cloud
$184k - $356.5k
NVIDIA Gruppe is looking for skilled software engineers to develop AI inference systems that operate with high efficiency. The role involves architecting high-performance inference frameworks and optimizing GPU processes. Ideal candidates should have extensive programming...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
AI Inference Compiler Engineer — MLIR & Kernel Optimizer
NVIDIA Gruppe in Santa Clara, California is seeking AI Compiler Engineers to drive technological innovation within their compiler organization. The role involves working on kernel generation and optimization for next-generation NVIDIA GPUs and solving complex compilation...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
AI Inference Systems Engineer - TensorRT Special Platforms
...Corporation is looking for a passionate Software Engineer to join the TensorRT team in Santa Clara, California... ...in deep learning and work with cutting-edge AI technology, contributing to high-performance AI inference solutions. Your role involves designing and developing...
NVIDIA Corporation
Santa Clara, CA
1 day ago
Senior AI Inference Compiler Engineer — Equity & Impact
$152k - $241.5k
NVIDIA Gruppe is hiring an AI & Deep Learning Compiler Engineer for the Deep Learning & AI Compiler team. This role involves analyzing deep learning networks and developing optimization algorithms while collaborating with software and GPU architecture teams. The ideal...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior AI & DL Kernel Engineer for Inference & GPUs Remote
$184k - $287.5k
A leading technology company is seeking a Senior Software Engineer for AI and DL Kernel Libraries in Santa Clara, CA. The role involves designing and optimizing kernels for high-impact AI workloads and collaborating with engineers on innovative solutions. Candidates should...
Remote job
NVIDIA Corporation
Santa Clara, CA
17 hours ago
Senior AI Inference Compiler Engineer - Drive Next-Gen DL
NVIDIA Gruppe in Santa Clara is seeking an AI & Deep Learning Compiler Engineer to join its Deep Learning & AI Compiler team. This role involves developing compiler IR and collaborating with various teams to enhance deep learning software. The ideal candidate will have...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior AI Systems Engineer — SGLang & Inference on GPUs
A leading technology company is seeking a skilled engineer to optimize deep learning frameworks and enhance GPU kernel performance. The ideal... ...dynamic work environment with a focus on innovative solutions and advancing AI technologies. #J-18808-Ljbffr Advanced Micro Devices
Advanced Micro Devices
Santa Clara, CA
3 days ago
AI Engineer
$200k - $300k
...AI Engineer (Forward Deployed) SF/NYC - $200k-$300k - early equity Join a small, high-calibre AI team working at the frontier of... ...training data from messy, real-world inputs Debug training and inference issues, improving model reliability Support deployment of...
scalr
Sunnyvale, CA
17 hours ago
AI Engineer
...Series B company ($85M+ raised) building an AI-native platform that's transforming how... ...teams work. We're looking for an AI Engineer who enjoys shipping products,... ...model fine-tuning, evaluation frameworks, or inference infrastruc ture Why Join?Small...
Work at office
UMATR
Santa Clara, CA
17 hours ago
Embodied AI Engineer
...Embodied AI Engineer UnitX builds the world's leading physical AI systems to automate repetitive visual tasks in factories. UnitX is... ...deploying, profiling, and optimizing ML models for real-time inference on robotic hardware (e.g., NVIDIA Jetson, TensorRT, CUDA). ~...
UnitX
Milpitas, CA
3 days ago
AI Engineer - Fury Team
$160k - $400k
...intelligent machines at scale. At Scout AI, we’re developing Fury, the first robotic... ...work. The Role We're looking for an AI Engineer to join the Fury Orchestration Team with... ..., multi-agent coordination, and edge inference optimization. Expect to rapidly prototype...
Full time
Relocation package
Scout AI
Sunnyvale, CA
1 day ago
Agentic AI Engineer
...Conduct advanced research in Generative AI, focusing on the latest advancements in LLMs. Develop and implement advanced techniques in multimodal LLMs, agentic AI, fine-tuning, distillation, inference optimization, test-time scaling, and reasoning models. Collaborate...
Tata Consultancy Services
Santa Clara, CA
3 days ago
High Speed AI Interconnect Signal Integrity Engineer
$100k
High Speed AI Interconnect Signal Integrity Engineer Tenstorrent is leading the industry on cutting‑edge AI technology, revolutionizing performance expectations... ...and optical technologies for next‑generation AI inference and training clusters. This role is on‑site in Santa...
Permanent employment
Tenstorrent Inc.
Santa Clara, CA
4 days ago
Altimate Al | Founding Generative AI Engineer
$50k - $120k
Who are we? Altimate AI, founded in 2022 in San Francisco, is revolutionizing enterprise... ...at the forefront of the AI-powered data engineering revolution. You can read more about us... ..., Kubernetes) for large-scale training, inference, and multi-agent orchestration....
Worldwide
EarlyStage Partners
Sunnyvale, CA
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Inference Engineer. Be the first to apply!