Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

ML Engineer - Inference & Model Deployment

Full-time

HiringCafe

Job discovery is broken. Indeed and LinkedIn want to keep it that way. HiringCafe is building a 100x better job search engine: fast, comprehensive, honest, and actually useful. We index millions of jobs, remove noise, rank what matters, and help people find real opportunities without dark patterns, ads, or pay-to-win placement. We are looking for a founding ML engineer who can help us turn powerful AI and ML models into fast, reliable production systems. You will own the bridge between model development and real user-facing infrastructure: deploying models, optimizing inference latency and throughput, scaling serving systems, and making sure our models run efficiently in production. This is a hands-on engineering role for someone who loves the details of model performance, GPU utilization, inference architecture, and production reliability. What You’ll Do Deploy and integrate researcher-trained model checkpoints into our cloud infrastructure and production pipelines. Profile and benchmark model performance to identify latency, throughput, memory, and compute bottlenecks. Implement optimization techniques such as quantization, pruning, batching, caching, efficient attention, and precision trade-offs while preserving model quality. Build scalable multi-GPU inference systems for search, ranking, recommendations, agents, and other AI-powered product experiences. Design reliable model-serving architecture that can support millions of users. Develop efficient training and fine-tuning workflows where needed, including distributed training, mixed precision, and parallelism strategies. Work closely with our search & engineering teams to make model deployment a smooth part of our development workflow. You May Be a Strong Fit If You Have deployed and optimized deep learning models in production environments. Have experience with large-scale model serving, multi-GPU inference, or high-throughput inference systems. Understand inference optimization techniques such as quantization, pruning, compilation, batching, caching, and memory optimization. Have strong instincts for profiling, benchmarking, and debugging model performance. Are familiar with efficient attention mechanisms, transformer optimization, or modern LLM/embedding/ranking model infrastructure. Have worked with inference frameworks or serving stacks such as SGLang, vLLM, TensorRT, or equivelant. Can write clean, production-quality code and integrate ML systems into backend infrastructure. Are comfortable with cloud platforms, distributed systems, storage systems, and modern ML training or serving workflows. Want ownership, leverage, and responsibility from day one. Logistics This role is based in Cupertino, where we work in person. We believe the best ideas come from being in the same room. We offer generous health, dental, and vision coverage, paid parental leave, and relocation support. Don’t meet every single qualification? That’s okay. We care more about your trajectory than checking every box. If the role excites you and the mission resonates, we’d love to hear from you.

Vacancy posted 18 hours ago
Similar jobs that could be interesting for youBased on the ML Engineer - Inference & Model Deployment in Cupertino, CA vacancy
  • $128.7k - $261.3k

     ...enables repeatable, high-velocity model deployments through principled and...  ...developers and deployment and infra engineers to ship numerically robust,...  ...Mathematics, Data Science / ML, or a closely related...  ...model compression / efficient inference or relevant experience... 
    Suggested
    Full time
    Local area
    Remote work
    Work from home
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    1 day ago
  • $128.7k - $261.3k

     ...enables repeatable, high-velocity model deployments through principled and...  ...developers and deployment and infra engineers to ship numerically robust,...  ...Mathematics, Data Science / ML, or a closely related...  ...model compression / efficient inference or relevant experience ~... 
    Suggested
    Local area
    Remote work
    Work from home
    Relocation package
    Flexible hours

    General Motors

    Mountain View, CA
    8 hours ago
  • $193.3k - $261.5k

     ...Inferentia and Trainium ML accelerators. This...  ...enabling unparalleled ML inference and training...  ...running a wide range of models and supporting novel architecture...  ...boundary, our engineers build systematic infrastructure...  ...and frameworks for deployment on custom ML hardware... 
    Suggested
    Work experience placement
    Internship
    Local area
    Flexible hours

    Amazon

    Cupertino, CA
    3 days ago
  •  ...Inference Optimization MLE At Rhoda AI, we're building...  ...-art foundation world models that control our...  ...across cloud and on-robot deployment targets. You will...  ...closely with research engineers to translate model innovations...  ...optimization, ML systems, or a closely... 
    Suggested

    Rhoda ai

    Palo Alto, CA
    1 day ago
  • $212.8k

     ...Responsibilities: - Convert and compile ML models for execution on edge NPUs,...  ...Science, Electrical Engineering, Computer Engineering, or a...  ...software engineering, model deployment, or ML systems for...  ...environments. - Understanding of model inference constraints on edge devices,... 
    Suggested
    Temporary work
    Local area

    ByteDance

    San Jose, CA
    1 day ago
  •  ...defined hardware to the foundational models and video world models that...  ...reality. We're looking for an ML Infrastructure Engineer to help build and operate the inference systems that power our...  ...throughput, and reliability of deployed models Design and scale services... 

    Rhoda AI

    Palo Alto, CA
    1 day ago
  • $128.7k - $261.3k

    The Model Deployment & Inference Solutions team in GM AV deploys machine learning models from training...  .... Our mission is two-fold: build the ML deployment platform that makes model...  ...Copilot, or equivalent) as part of your engineering workflow. Experience designing clean,... 
    Flexible hours

    General Motors

    Sunnyvale, CA
    5 days ago
  • $244.8k

     ...research groups dedicated to generative models for content creation, image...  ...experienced Multimodal Model Training and Inference Optimization Engineer with expertise in optimizing AI...  ...enhancing the performance, scalability, and deployment of large-scale generative AI models.... 
    Temporary work
    Local area

    ByteDance

    San Jose, CA
    5 days ago
  • $124k

     ...AI, we're not just training models, we're building the foundation...  ...enterprise scale. In addition, we deploy these models to edge hardware...  ...architecture for quantized inference, if you excel at making...  ...with AI compiler, inference engine, and silicon teams to ensure... 
    Hourly pay
    Full time
    Temporary work
    Immediate start
    Flexible hours

    Tesla

    Palo Alto, CA
    1 day ago
  • $246.5k

     ...our Machine Learning and Inference Platform that powers...  ...hardware, software, and models. We're looking for a strong...  ...deep experience in ML serving, high-performance...  ...excited to mentor engineers, innovate at scale, and...  ...experience in developing and deploying large-scale,... 
    Work at office
    Local area
    Remote work
    Monday to Thursday
    Flexible hours

    Roku

    San Jose, CA
    2 days ago
  • Rhoda ai in Palo Alto is seeking an Inference Infrastructure Engineer to help power their model deployment stack for humanoid robots. This role involves designing and operating large-scale infrastructure while managing cloud and on-prem environments efficiently. With a... 

    Rhoda ai

    Palo Alto, CA
    3 days ago
  • $184k - $287.5k

     ...re accelerating it. The TensorRT inference platform is the backbone of modern...  ...'s fastest and most efficient deployment of cutting-edge deep learning models on every NVIDIA GPU. With demand...  ...seeking a highly skilled and driven Engineering Manager to take the lead in developing... 

    NVIDIA Corporation

    Santa Clara, CA
    5 days ago
  •  ...About the job ML Engineer Our Client Is a rapidly growing...  ...of applied intelligence from model optimization to productized...  ...Responsibilities Design, build, and deploy production-grade ML systems...  ...model training, deployment, inference, and monitoring in production... 
    Full time

    Catalyst Labs, LLC

    Sunnyvale, CA
    4 days ago
  • $128.7k - $261.3k

     ...Senior Compiler Engineer GM's vision of Zero Crashes...  ...pioneer new approaches to model export, kernel...  ...models into fast, reliable inference across GPUs powering GM...  ...kernel integration, and deployment tooling, with a mandate...  ...reliable, and effortless for ML engineers across the AV... 
    Flexible hours

    General Motors

    Sunnyvale, CA
    1 day ago
  • $159.05k - $199.3k

     ...ML Runtime Optimization Engineer Sunnyvale, California, United States About Applied Intuition...  ...deep experience in optimizing ML models and deploying them on production-grade embedded...  ...optimize efficiency and latency of model inference for compute boards selected by our... 
    Full time
    For contractors
    For subcontractor
    Casual work
    Work at office
    Remote work
    Day shift

    Applied Intuition

    Sunnyvale, CA
    2 days ago
  • $181.1k - $272.1k

     ...ML Infrastructure Engineer - Multimodal Training Tools, SIML Work Locations...  ...groundbreaking generative modeling technologies to enrich billions...  ...training, adapting and deploying large-scale generative...  ...workflows Tools for efficient inference and hosting of models for... 
    Relocation

    Apple

    Cupertino, CA
    1 day ago
  • $278.1k - $347.6k

     ...USA Principal Machine Learning Engineer, Mobile AI Inference Optimization Location Mountain...  ...generation of mobile game AI experiences, deploying world models to mobile on-device. As our...  ...architectural decisions across the full mobile ML stack, and mentor a team of senior... 
    Work at office
    Worldwide
    Relocation package

    Unity Technologies

    Mountain View, CA
    1 day ago
  •  ...industry-leading training and inference speeds and empowers...  ...run large-scale ML applications, without the...  ...customers include top model labs, global enterprises...  ...partnership with Cerebras, to deploy 750 megawatts of scale,...  ...The Inference ML Engineering team at Cerebras Systems... 

    CEREBRAS SYSTEMS INC.

    Sunnyvale, CA
    4 days ago
  • $181.1k - $318.4k

     ...Clara, California, is looking for an experienced Machine Learning engineer to optimize and build production-grade solutions serving...  ...technologies, contributing directly to optimizing language and vision models. Applicants should have at least 5 years of industry experience... 

    Apple Inc.

    Santa Clara, CA
    4 days ago
  •  ...automotive company is seeking a Staff ML Infrastructure Engineer to build robust compute platforms for...  ...ML engineers to ensure efficient model serving, leading technical decision-making...  ..., Python or C++, and expertise in ML inference. The position offers a hybrid work... 

    General Motors

    Sunnyvale, CA
    3 days ago
  • $212.3k - $275.8k

     ...collaborate with product and engineering teams to deploy reliable, secure, and...  ...You'll work on cutting-edge inference optimization - speculative...  ...deployment automation, and model/service observability. This...  ...building production services for ML/AI workloads. ~... 
    Full time
    Temporary work
    Local area
    Flexible hours
    3 days per week

    Cisco

    San Jose, CA
    2 days ago
  • $254k - $349.25k

     ...seeking a Principal ML Architect to lead the...  ...requires deep expertise in model architecture, training...  ...for efficient deployment in enterprise environments...  ...Optimize inference systems for low latency...  ...CASB, etc.) Systems & Engineering Experience designing... 
    Flexible hours

    Proofpoint

    Sunnyvale, CA
    5 days ago
  •  ...is hiring a Machine Learning Systems Engineer in Cupertino, California. You will collaborate with Siri modeling teams to optimize model training and inference on Apple's custom Silicon. The ideal candidate has strong experience in ML models, with proficiency in Python... 

    Apple Inc.

    Cupertino, CA
    4 days ago
  •  ...training, evaluation, and deployment of offboard perception models. Own the integration of...  ...Implement CI/CD pipelines for ML systems, including...  ...including training metrics, inference performance metrics, data...  ...edge cases. Partner with ML engineers, researchers, and... 
    Local area
    Remote work
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    2 days ago
  •  ...skilled professional to enhance the performance of large-scale models through advanced optimization techniques in Santa Clara, California...  ...should have a strong background in DL model training and deployment, ideally with a PhD or equivalent experience in Computer Science... 

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $254k - $349.25k

     ...are seeking a Principal ML Architect to lead the...  ...deep expertise in model architecture, training...  ...compression for efficient deployment in enterprise...  ...environments Optimize inference systems for low latency...  ...CASB, etc.) Systems & Engineering Experience designing... 
    Flexible hours

    Proofpoint

    Sunnyvale, CA
    2 days ago
  •  ...industry‑leading training and inference speeds and empowers...  ...run large‑scale ML applications, without the...  ...customers include top model labs, global enterprises...  ...multi‑year partnership to deploy 750 megawatts of scale,...  ...looking for a Software Engineer to join the ML... 
    Work at office
    Remote work

    Dormont Manufacturing Co

    Sunnyvale, CA
    2 days ago
  • $215.28k - $364.32k

     ...Staff Machine Learning Engineer - Foundation Model Santa Clara, CA XPENG is a leading smart technology company at the forefront of innovation...  ...engineers, and infrastructure experts to design, train, and deploy large-scale multi-modal models that unify vision, language... 
    Full time

    XPENG

    Santa Clara, CA
    5 days ago
  • $220k - $320k

     ...Institute of Foundation Models We are a dedicated...  ...data scientists, and engineers, tackling the most fundamental...  ...agent, reasoning, and deployment teams. Academic...  ...familiarity with the ML training lifecycle....  ...pre-training and inference, know what a checkpoint... 
    Visa sponsorship
    Flexible hours

    Institute of Foundation Models

    Sunnyvale, CA
    1 day ago
  • $212.7k - $287.7k

     ...optimize LLMs such as Llama and GPT-OSS to run really fast on Trainium. As the SDM for the LLM Inference Model Enablement team, you will lead a team of expert AI/ML engineers to onboard and optimize state-of-the-art open-source and customer LLMs, both dense and MoE, for... 
    Local area
    Flexible hours

    Amazon

    Cupertino, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to ML Engineer - Inference & Model Deployment. Be the first to apply!