LLM Inference Frameworks and Optimization Engineer

Gravity Engineering Services Pvt Ltd.

About the Role At Together.ai, we are building state-of-the-art infrastructure to enable efficient and scalable inference for large language models (LLMs). Our mission is to optimize inference frameworks, algorithms, and infrastructure, pushing the boundaries of performance, scalability, and cost-efficiency. We are seeking an Inference Frameworks and Optimization Engineer to design, develop, and optimize distributed inference engines that support multimodal and language models at scale. This role will focus on low-latency, high-throughput inference, GPU/accelerator optimizations, and software-hardware co-design, ensuring efficient large-scale deployment of LLMs and vision models. This role offers a unique opportunity to shape the future of LLM inference infrastructure, ensuring scalable, high-performance AI deployment across a diverse range of applications. If you're passionate about pushing the boundaries of AI inference, we’d love to hear from you! Responsibilities Inference Framework Development and Optimization Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models. Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving. Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability. Software-Hardware Co-Design and AI Infrastructure Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators. Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines. Requirements Must‑Have: Experience: 3+ years of experience in deep learning inference frameworks, distributed systems, or high-performance computing. Technical Skills: Familiar with at least one LLM inference frameworks (e.g., TensorRT‑LLM, vLLM, SGLang, TGI(Text Generation Inference) ). Background knowledge and experience in at least one of the following: GPU programming (CUDA/Triton/TensorRT), compiler, model quantization, and GPU cluster scheduling . Deep understanding of KV cache systems like Mooncake, PagedAttention, or custom in‑house variants. Programming: Proficient in Python and C++/CUDA for high-performance deep learning inference. Optimization Techniques: Deep understanding of Transformer architectures and LLM/VLM/Diffusion model optimization . Knowledge of inference optimization , such as workload scheduling, CUDA graph, compiled, efficient kernels. Soft Skills: Strong analytical problem‑solving skills with a performance‑driven mindset. Excellent collaboration and communication skills across teams. Nice‑to‑Have: Experience in developing software systems for large‑scale data center networks with RDMA/RoCE . Familiar with distributed filesystem (e.g., 3FS, HDFS, Ceph ). Familiar with open‑source distributed scheduling/orchestration frameworks, such as Kubernetes (K8S) . Contributions to open‑source deep learning inference projects. #J-18808-Ljbffr Gravity Engineering Services Pvt Ltd.

Apply

Vacancy posted 9 hours ago

Similar jobs that could be interesting for youBased on the LLM Inference Frameworks and Optimization Engineer in San Francisco, CA vacancy

LLM Inference Frameworks and Optimization Engineer
$160k - $230k
...enable efficient and scalable inference for large language models (LLMs). Our mission is to optimize inference frameworks, algorithms, and... ...Frameworks and Optimization Engineer to design, develop, and optimize... ...opportunity to shape the future of LLM inference infrastructure,...
Suggested
Full time
Together AI
San Francisco, CA
3 days ago
LLM Inference & Optimization Engineer
Gravity Engineering Services Pvt Ltd. is looking for an Inference Frameworks and Optimization Engineer to enhance the performance of AI infrastructure. This role involves designing distributed inference engines that support multimodal models, optimizing frameworks for low...
Suggested
Gravity Engineering Services Pvt Ltd.
San Francisco, CA
9 hours ago
Distributed LLM Inference Engineer
...systems expert. About the Role As a Distributed LLM Inference Engineer, you will help systems and optimizations that push the boundaries of performance for... ...Familiarity with deep learning and deep learning frameworks (e.g. PyTorch ) Solid understanding of distributed...
Suggested
Gravity Engineering Services Pvt Ltd.
San Francisco, CA
4 days ago
System Engineering In
Gravity Engineering Services Pvt Ltd. is looking for a Distributed LLM Inference Engineer to join their team. This critical role focuses on enhancing performance for... ...in the field. Familiarity with deep learning frameworks like PyTorch and knowledge of distributed systems...
Suggested
Gravity Engineering Services Pvt Ltd.
San Francisco, CA
4 days ago
Staff GenAI Inference Engineer: Optimize LLM Serving Latency
$190.9k - $232.8k
A leading data and AI company is seeking a Staff Software Engineer for GenAI inference to lead the architecture and optimization of the inference engine. The role requires expertise in CUDA, GPU programming, and distributed systems design. Ideal candidates will have a...
Suggested
Menlo Ventures
San Francisco, CA
9 hours ago
LLM Inference Engineer
...are specifically seeking an expert in high‑performance LLM serving systems and inference optimization. In this role, you will push the boundaries of how... ...with expertise debugging and optimizing major inference engines such as SGLang, vLLM, or TensorRT. Deep knowledge of...
NEAR AI
San Francisco, CA
2 days ago
Distributed LLM Inference Engineer - Scale AI at Speed
Anyscale is seeking a Distributed LLM Inference Engineer in San Francisco, California. This pivotal role involves pushing the boundaries of performance... ...of distributed systems and familiarity with deep learning frameworks, ideally with experience in PyTorch and Ray. Anyscale...
Anyscale
San Francisco, CA
3 days ago
Senior LLM Serving & Inference Engineer
...San Francisco is seeking an expert in high-performance LLM serving systems and inference optimization. In this role, you will push the boundaries of how... ...role requires deep hands-on experience with inference engines such as SGLang, vLLM or TensorRT, GPU architectures,...
NEAR AI
San Francisco, CA
2 days ago
Staff Engineer - LLM Inference & Serving at Scale
$150k - $300k
Prime Intellect is looking for a skilled ML Systems Engineer to build and optimize LLM serving infrastructure and inference systems. This hybrid role involves contributing to the scalability of their reinforcement learning training. Successful candidates will have over...
Relocation package
Prime Intellect
San Francisco, CA
2 days ago
Staff Engineer, LLM Inference & Serving
Mixpeek in San Francisco seeks an inference researcher to advance end-to-end LLM serving, from kernels to autoscaling... ..., memory management, and inference optimizations, collaborating with the research lead and customer-facing engineers. You will work with external labs and...
Work at office
Mixpeek
San Francisco, CA
3 days ago
LLM Inference Engineer: Scalable Serving (SF Onsite)
Gravity Engineering Services Pvt Ltd. is seeking a talented individual in San Francisco to architect and implement robust, scalable inference systems for AI models. This in-person role focuses on optimizing model serving infrastructures for high throughput and low latency...
Gravity Engineering Services Pvt Ltd.
San Francisco, CA
4 days ago
Senior Inference Performance Engineer — GPU & CUDA
$220k - $320k
A tech startup specializing in AI inference seeks a skilled professional to optimize their inference stack. Candidates should have over 2 years of experience... ..., fluency in Python, and hands-on experience with LLM frameworks. The role offers competitive compensation of $220,0...
Local area
Inference
San Francisco, CA
2 days ago
GPU Systems Engineer — HPC & AI Inference (On-site)
Vast.ai is seeking a systems engineer to scale AI inference and optimize GPU performance at our San Francisco or Los Angeles offices. You will leverage... ...have advanced C++, experience with parallel frameworks, and a strong track record in high-performance systems...
Full time
Vast.ai
San Francisco, CA
1 day ago
GPU Systems Engineer - Scale AI Inference (On-site SF/LA)
Vast.ai Inc. is seeking a systems engineer with HPC or parallel programming experience to help scale AI inference. You will design and optimize GPU kernels and tensor libraries, leveraging CUDA/C++ and related frameworks to push the bleeding edge of AI performance. This...
Vast.ai Inc.
San Francisco, CA
1 day ago
GPU Systems Engineer - HPC / Parallel Computing
...shipping excellence. We seek engineers with strong intrinsic... ...to help scale AI inference. You’ll leverage your... ...performance systems to optimize GPU performance at the... ...at least one parallel framework (CUDA, HIP, SYCL,... ...(virtual, 30 minutes) LLM-assisted coding assessment...
Full time
Work at office
Vast
San Francisco, CA
17 hours ago
Staff Engineer, Inference & RL Systems — Scale Production ML
$225k
Dormont Manufacturing Co is looking for a Software Engineer on the Inference & RL Systems team in San Francisco. The role involves designing distributed systems, optimizing performance, and ensuring high reliability for RL and post-training workflows. The ideal candidate...
Dormont Manufacturing Co
San Francisco, CA
2 days ago
Staff Engineer, ML Inference Systems
...of Technical Staff focused on ML Systems & Inference in San Francisco, California. This role includes building and optimizing systems that improve latency and efficiency... ...workloads. The ideal candidate has strong software engineering roots and experience in inference systems....
Acceler8 Talent
San Francisco, CA
9 hours ago
Staff Engineer, AI Inference & Distributed Systems
...in San Francisco is seeking a talented engineer to design and implement robust systems that ensure fast and cost-efficient AI inference at global scale. You will be responsible... ...building high-performance schedulers and optimizing global routing while focusing on deep observability...
Sail Research
San Francisco, CA
3 days ago
Senior Inference Performance Engineer - GPU & CUDA
$220k - $320k
inference.net, a growing company in San Francisco, seeks an experienced engineer to optimize AI inference performance. The ideal candidate will have over 2 years of experience... ...optimization techniques and debugging inference frameworks. The role offers a competitive salary of...
inference.net
San Francisco, CA
3 days ago
On-Device LLM Inference Engineer — Real-Time, Rust & GPU
Mirai Labs in San Francisco seeks engineers to join a senior team building the full on-device stack for real-time local intelligence. You will primarily work on uzu, our inference engine, and focus on supporting new modalities and a variety of features. The ideal candidates...
Local area
Mirai Labs
San Francisco, CA
2 days ago
Founding Engineer, ML Inference
...of YC and unicorn founders and senior engineers with deep expertise in 3D, generative... ...looking for a Founding Engineer, ML Inference with deep expertise in high-performance... ...stack, designing novel inference frameworks, optimizing inference performance, and shaping the...
Relocation
Visa sponsorship
Relocation package
Reactor
San Francisco, CA
4 days ago
Inference Runtime Engineer for LLMs & Diffusion
Inferact is seeking an inference runtime engineer to enhance the performance and capabilities of LLM and diffusion model serving. This role requires expertise in optimizing model execution on various hardware architectures and has significant implications for AI inference...
Remote work
Inferact
San Francisco, CA
17 days ago
INFERENCE ENGINEER
...ROLE You build and operate the inference systems that serve our... ...serving infrastructure, runtime optimization, and the long tail of... ...real workloads. This is an engineering role, not a research role. You... ...contributions to inference / serving frameworks Experience with mixed cloud...
MakerMaker
San Francisco, CA
9 hours ago
Member of Technical Staff (AI Inference Engineer)
$220k
We build and run the inference engine behind every Perplexity query and deploy... .... You understand modern LLM architectures and are able... ...touched any of ML compilers and framework internals: PyTorch internals... ...architectures and inference optimization techniques (e.g....
Perplexity
San Francisco, CA
3 days ago
Senior Systems Engineering
$225k
...domain‑specific RL, ultra‑long context, and inference‑time compute to achieve this goal. About The Role As a Software Engineer on the Inference & RL Systems team, you will... ...high‑performance inference serving systems Optimize KV‑cache management, batching strategies, and...
Relocation
Visa sponsorship
Magic
San Francisco, CA
4 days ago
AI Systems Engineer — Efficient Inference & RL
Gravity Engineering Services Pvt Ltd. in San Francisco is looking for a specialized engineer to advance the efficiency of ML inference systems. The role encompasses algorithm design, system optimization, and the integration of RL-driven training techniques. Ideal candidates...
Gravity Engineering Services Pvt Ltd.
San Francisco, CA
2 days ago
LLM Reliability Engineer: Fuzz Testing & Production
$150k
A technology company in San Francisco seeks a Research Engineer to develop their reliability platform for LLM applications. The role focuses on optimization and testing methodologies while emphasizing hands-on implementation and collaboration with clients. Ideal candidates...
Enboarder
San Francisco, CA
9 hours ago
AI Systems Engineer, Codex Agents
AI Systems Engineer - Codex Core Agents The Codex Core... ..., model interaction, inference, sandboxed execution,... ...harness stack, build frameworks for assessing production... ...hands‑on experience with LLM applications, coding... ..., runtimes, inference optimization, GPU systems,...
AI Chopping Block, Inc.
San Francisco, CA
1 day ago
Senior ML Inference Systems Engineer
...workloads is seeking a Member of Technical Staff to design and optimize inference systems. The role involves managing KV cache allocation and... ...components. Ideal candidates should have strong software engineering skills and experience with ML inference systems,...
Gimlet Labs
San Francisco, CA
1 day ago
Founding Engineer
...for founding Machine Learning Engineers (MLEs) to own and improve... ...work at the intersection of LLM inference, browser understanding, and... ...than manual input Design and optimize LLM inference pipelines, including... ...and server Build evaluation frameworks and data pipelines to...
Icehouseventures
San Francisco, CA
5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to LLM Inference Frameworks and Optimization Engineer. Be the first to apply!