Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

LLM Inference Frameworks and Optimization Engineer

Gravity Engineering Services Pvt Ltd.

About the Role At Together.ai, we are building state-of-the-art infrastructure to enable efficient and scalable inference for large language models (LLMs). Our mission is to optimize inference frameworks, algorithms, and infrastructure, pushing the boundaries of performance, scalability, and cost-efficiency. We are seeking an Inference Frameworks and Optimization Engineer to design, develop, and optimize distributed inference engines that support multimodal and language models at scale. This role will focus on low-latency, high-throughput inference, GPU/accelerator optimizations, and software-hardware co-design, ensuring efficient large-scale deployment of LLMs and vision models. This role offers a unique opportunity to shape the future of LLM inference infrastructure, ensuring scalable, high-performance AI deployment across a diverse range of applications. If you're passionate about pushing the boundaries of AI inference, we’d love to hear from you! Responsibilities Inference Framework Development and Optimization Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models. Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving. Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability. Software-Hardware Co-Design and AI Infrastructure Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators. Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines. Requirements Must‑Have: Experience: 3+ years of experience in deep learning inference frameworks, distributed systems, or high-performance computing. Technical Skills: Familiar with at least one LLM inference frameworks (e.g., TensorRT‑LLM, vLLM, SGLang, TGI(Text Generation Inference) ). Background knowledge and experience in at least one of the following: GPU programming (CUDA/Triton/TensorRT), compiler, model quantization, and GPU cluster scheduling . Deep understanding of KV cache systems like Mooncake, PagedAttention, or custom in‑house variants. Programming: Proficient in Python and C++/CUDA for high-performance deep learning inference. Optimization Techniques: Deep understanding of Transformer architectures and LLM/VLM/Diffusion model optimization . Knowledge of inference optimization , such as workload scheduling, CUDA graph, compiled, efficient kernels. Soft Skills: Strong analytical problem‑solving skills with a performance‑driven mindset. Excellent collaboration and communication skills across teams. Nice‑to‑Have: Experience in developing software systems for large‑scale data center networks with RDMA/RoCE . Familiar with distributed filesystem (e.g., 3FS, HDFS, Ceph ). Familiar with open‑source distributed scheduling/orchestration frameworks, such as Kubernetes (K8S) . Contributions to open‑source deep learning inference projects. #J-18808-Ljbffr Gravity Engineering Services Pvt Ltd.

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the LLM Inference Frameworks and Optimization Engineer in San Francisco, CA vacancy
  • $160k - $230k

     ...enable efficient and scalable inference for large language models (LLMs). Our mission is to optimize inference frameworks, algorithms, and...  ...Frameworks and Optimization Engineer to design, develop, and optimize...  ...optimizations, TensorRT/TRT‑LLM graph optimizations, PyTorch... 
    Suggested
    Full time

    Together AI

    San Francisco, CA
    13 hours ago
  • $160k - $230k

    Together AI is seeking an Inference Frameworks and Optimization Engineer in San Francisco, California. The role focuses on designing and optimizing distributed inference engines, ensuring efficient deployment of large language models and vision models. The ideal candidate... 
    Suggested

    Together AI

    San Francisco, CA
    13 hours ago
  • Gravity Engineering Services Pvt Ltd. is looking for an Inference Frameworks and Optimization Engineer to enhance the performance of AI infrastructure. This role involves designing distributed inference engines that support multimodal models, optimizing frameworks for low... 
    Suggested

    Gravity Engineering Services Pvt Ltd.

    San Francisco, CA
    3 days ago
  • Gravity Engineering Services Pvt Ltd. is looking for a Distributed LLM Inference Engineer to join their team. This critical role focuses on enhancing performance for...  ...in the field. Familiarity with deep learning frameworks like PyTorch and knowledge of distributed systems... 
    Suggested

    Gravity Engineering Services Pvt Ltd.

    San Francisco, CA
    2 days ago
  •  ...raised to date. About the role As a Distributed LLM Inference Engineer, you will help with systems and optimizations that push the boundaries of performance for...  ...Familiarity with deep learning and deep learning frameworks (e.g. PyTorch) Solid understanding of distributed... 
    Suggested
    Work at office

    Anyscale

    San Francisco, CA
    1 day ago
  • $190.9k - $232.8k

    A leading data and AI company is seeking a Staff Software Engineer for GenAI inference to lead the architecture and optimization of the inference engine. The role requires expertise in CUDA, GPU programming, and distributed systems design. Ideal candidates will have a... 

    Menlo Ventures

    San Francisco, CA
    3 days ago
  • Anyscale is seeking a Distributed LLM Inference Engineer in San Francisco, California. This pivotal role involves pushing the boundaries of performance...  ...of distributed systems and familiarity with deep learning frameworks, ideally with experience in PyTorch and Ray. Anyscale... 

    Anyscale

    San Francisco, CA
    1 day ago
  • $150k - $300k

    Prime Intellect is looking for a skilled ML Systems Engineer to build and optimize LLM serving infrastructure and inference systems. This hybrid role involves contributing to the scalability of their reinforcement learning training. Successful candidates will have over... 
    Relocation package

    Prime Intellect

    San Francisco, CA
    13 hours ago
  •  ...platform company in San Francisco is seeking a Software Engineer focused on machine learning performance. This role...  ...implementing advanced techniques for ML model inference and debugging performance issues with frameworks like PyTorch and TensorRT. The ideal candidate... 

    Baseten

    San Francisco, CA
    1 day ago
  • $167.2k - $209k

     ...We are seeking a Senior Engineer 2 to join our AI Inference Data Plane team. In this...  ...standards. Performance Optimization: Implement and optimise distributed...  ..., or Modular. Inference Frameworks: Familiarity with...  ...serving frameworks such as llm‑d, NVIDIA Dynamo, or Ray... 
    Local area
    Remote work
    Worldwide
    Flexible hours

    DigitalOcean

    San Francisco, CA
    13 hours ago
  • Liquid AI is seeking a Systems Programmer to join their Edge Inference team in San Francisco. In this role, you will implement and optimize inference kernels on various hardware, ensuring efficiency and performance. Ideal candidates have over 5 years of systems programming... 
    Flexible hours

    Liquid AI

    San Francisco, CA
    3 days ago
  • $200k

    Plaud is seeking skilled AI engineers to join their core SpeechLLM lab in San Francisco. You will play a crucial role in building high-throughput inference engines for conversational AI and optimizing GPU performance while collaborating with various teams. The position... 
    Work at office

    Plaud

    San Francisco, CA
    1 day ago
  • Gravity Engineering Services Pvt Ltd. is seeking a talented individual in San Francisco to architect and implement robust, scalable inference systems for AI models. This in-person role focuses on optimizing model serving infrastructures for high throughput and low latency... 

    Gravity Engineering Services Pvt Ltd.

    San Francisco, CA
    2 days ago
  • $300k

     ...leading technology firm in San Francisco seeks a GPU Optimisation Engineer to maximize GPU performance in real-time AI systems. The...  ..., a deep understanding of GPU execution, and a knack for optimizing inference latency for large generative models. With a competitive base... 
    Visa sponsorship
    Relocation package

    Trades Workforce Solutions

    San Francisco, CA
    4 days ago
  • $220k - $320k

    A tech startup specializing in AI inference seeks a skilled professional to optimize their inference stack. Candidates should have over 2 years of experience...  ..., fluency in Python, and hands-on experience with LLM frameworks. The role offers competitive compensation of $220,0... 
    Local area

    Inference

    San Francisco, CA
    13 hours ago
  • $197.3k - $225.1k

     ...Lead AI Engineer (FM Hosting, LLM Inference) Overview At Capital One, we are creating responsible and reliable AI systems, changing banking...  ...and more. Invent and introduce state-of-the-art LLM optimization techniques to improve the performance - scalability,... 
    Full time
    Part time
    Local area

    Capital One Financial Corp

    San Francisco, CA
    3 days ago
  • $229.9k - $262.4k

     ...Sr. Lead AI Engineer (Inference Optimization, FM hosting, AI Platform) Overview: At Capital One, we are creating responsible and reliable AI...  ...PyTorch, and more. ~ Invent and introduce state-of-the-art LLM optimization techniques to improve the performance -... 
    Full time
    Part time
    Local area

    Capital One Financial Corp

    San Francisco, CA
    4 days ago
  • $90 - $125 per hour

    A cutting-edge AI company is looking for Low-Level Engineers to design RL environments that optimize kernel development and systems programming. Candidates should have strong Python skills and a solid understanding of LLMs. This remote contractor role offers an hourly rate... 
    Remote job
    Hourly pay
    For contractors

    Open Data Science

    San Francisco, CA
    3 days ago
  • Zensors is seeking a Machine Learning Engineer focused on ML Runtime & Optimization to enhance our visual sensing platform. The role involves optimizing machine learning pipelines and collaborating with AI research teams to implement high-performance algorithms. Ideal candidates... 

    Zensors

    San Francisco, CA
    1 day ago
  •  ...currently Tuesday. The Field Engineering team is a group of ML...  ...‑on with customers to optimize, deploy, and scale ML workloads...  ...‑on experience in ML inference, model optimization, benchmarking...  ...Have Familiarity with LLM inference optimization frameworks (vLLM, sgLang, Modular,... 
    Hourly pay
    Summer work
    Internship
    Work at office
    Local area
    Flexible hours

    Lambda Inc.

    San Francisco, CA
    3 days ago
  •  ...ROLE You build and operate the inference systems that serve our...  ...serving infrastructure, runtime optimization, and the long tail of...  ...real workloads. This is an engineering role, not a research role. You...  ...contributions to inference / serving frameworks Experience with mixed cloud... 

    MakerMaker

    San Francisco, CA
    3 days ago
  •  ...of YC and unicorn founders and senior engineers with deep expertise in 3D, generative...  ...looking for a Founding Engineer, ML Inference with deep expertise in high-performance...  ...stack, designing novel inference frameworks, optimizing inference performance, and shaping the... 
    Relocation
    Visa sponsorship
    Relocation package

    Reactor

    San Francisco, CA
    2 days ago
  • Principal AI Engineer (LLM Agents & Orchestration) Role Title: Principal AI Engineer (LLM...  ...stateful agentic workflows (using frameworks like LangGraph or custom Python/TypeScript...  ...reliably. Latency & Reliability: Optimize inference pipelines for speed (streaming, token... 

    ImagineArt

    San Francisco, CA
    4 days ago
  •  ...Make: As a staff software engineer, you will lead two areas that...  ...strategy for workflow and backend optimization. Lead and contribute to...  ...-tune AI capabilities for AI/LLM-driven scenarios. ~...  ...databases, and orchestration framework. ~ Proficiency in crafting... 
    Work experience placement
    Flexible hours

    airbnb, Inc.

    San Francisco, CA
    4 days ago
  • $220k - $320k

    inference.net, a growing company in San Francisco, seeks an experienced engineer to optimize AI inference performance. The ideal candidate will have over 2 years of experience...  ...optimization techniques and debugging inference frameworks. The role offers a competitive salary of... 

    inference.net

    San Francisco, CA
    1 day ago
  • $220k

    We build and run the inference engine behind every Perplexity query and deploy...  .... You understand modern LLM architectures and are able...  ...touched any of ML compilers and framework internals: PyTorch internals...  ...architectures and inference optimization techniques (e.g.... 

    Perplexity

    San Francisco, CA
    1 day ago
  • $225k

     ...domain‑specific RL, ultra‑long context, and inference‑time compute to achieve this goal. About The Role As a Software Engineer on the Inference & RL Systems team, you will...  ...high‑performance inference serving systems Optimize KV‑cache management, batching strategies, and... 
    Relocation
    Visa sponsorship

    Magic

    San Francisco, CA
    2 days ago
  •  ...analytics company in San Francisco is seeking an experienced engineer to support AI applications focused on safety and performance. The role involves architecting frameworks, building modular agents, and scaling LLM infrastructure. Ideal candidates have significant backend... 

    TRM Labs

    San Francisco, CA
    1 day ago
  • $264.8k - $331k

     ...clients. As an ML Sys Research Engineer, you'll work on building out...  ...-of-the-art technologies to optimize our ML system. Your customer...  ...optimize our training and inference framework. Post-train state of the...  ...At least 1-3 years of LLM training in a production environment... 
    Full time

    Scale AI

    San Francisco, CA
    3 days ago
  •  ...practice of medicine—and the inference systems that power...  ...We’re looking for an Engineering Manager to lead and...  ...pushing the frontier of LLM serving techniques....  ...Research teams on model optimization, quantization, and deployment...  ...systems and inference frameworks (e.g., PyTorch,... 
    Hourly pay
    Full time
    Flexible hours

    AI Chopping Block, Inc.

    San Francisco, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to LLM Inference Frameworks and Optimization Engineer. Be the first to apply!