Engineer, Inference & Model serving
$220k - $320ktechire ai
Job Description ML Model Serving Engineer Want to build the layer that actually makes AI usable in real time? You'll join a team focused on inference, where performance is the product. This is about delivering low-latency, high-throughput systems across LLMs, speech, and vision models running in production, not offline experiments. They're building real-time AI systems that need to respond instantly, reliably, and at scale. That means solving hard problems around batching, GPU efficiency, memory constraints, and system-level bottlenecks that most teams never fully crack. You'll sit at the core of the platform, working across model serving, infrastructure, and performance optimisation. A big part of the role is pushing current tooling beyond its limits, extending frameworks, profiling bottlenecks, and designing systems that hold up under real-world load. This is not about training models. It's about making them fast, efficient, and production-ready. What you'll work on:
Exposure to CUDA, GPU profiling tools, or systems like Kubernetes and Ray is useful, but the key is knowing how to make models run efficiently at scale. You'll join a highly technical team with experience across major AI labs and big tech. The environment is pragmatic, focused on solving real performance problems rather than abstract research. There's real ownership here. You'll help define how next-generation AI systems are served. Package:
$220,000 - $320,000 base + equity
San Francisco, onsite 3 days per week If you're interested in working on the part of AI that actually determines whether it works in the real world, this is worth exploring. All applicants will receive a response.
- Building high-performance serving systems for LLM, speech, and vision models
- Scaling inference to production workloads with strict latency requirements
- Optimising GPU utilisation and execution efficiency
- Implementing techniques like continuous batching, KV cache optimisation, speculative decoding, and prefill/decode separation
- Improving frameworks such as vLLM, TensorRT-LLM, Triton, and SGLang
- Profiling and debugging performance across GPU, memory, and system layers
- Strong experience with ML inference or model serving systems
ID: 34247 Copilot Symbol
Access Evo Actions
Engineer, Inference & Model serving
Sesame AI
Job ID: 34247 Applications
57
Shortlisted
4
Sent
11 1st Interview
13 2nd+ Interview
0 Offers
0 Placed
0 Renewal
0 Details Custom Fields Descriptions & Ratings Compensation & Fees Activities Files Onboarding Approval process Shift Setting Integrations Upload JD
No file chosen
Original document Job Summary
Public job description
Internal job description
Ratings & Screening questions Note: This JD will be posted to job boards; please remember to remove the Company details and Contact information. Quick Post Job Job title
Engineer, Inference & Model serving Job owner: Marc Powell Company: Sesame AI Contact: Brown Ryan Privacy
Only Public Jobs can be shared
Private Public Apps
Visit the App Store indeed
Your job will go live on Indeed once it adheres to their quality standards.
For more information on this, please head to our Help Center Your changes have been saved successfully. - Deep understanding of latency and throughput optimisation in production
- Solid Python and PyTorch skills, plus a systems or performance engineering mindset
- Familiarity with distributed systems and production infrastructure
Exposure to CUDA, GPU profiling tools, or systems like Kubernetes and Ray is useful, but the key is knowing how to make models run efficiently at scale. You'll join a highly technical team with experience across major AI labs and big tech. The environment is pragmatic, focused on solving real performance problems rather than abstract research. There's real ownership here. You'll help define how next-generation AI systems are served. Package:
$220,000 - $320,000 base + equity
San Francisco, onsite 3 days per week If you're interested in working on the part of AI that actually determines whether it works in the real world, this is worth exploring. All applicants will receive a response.
Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the Engineer, Inference & Model serving in San Francisco, CA vacancy
- ...Model Implementation Engineer Sciforium is an AI infrastructure company developing next-generation multimodal... ...and a proprietary, high-efficiency serving platform. Backed by multi-million-... ...with large-scale model training or inference systems. Contributions to open-...SuggestedFlexible hours
$167.2k - $209k
...applications. We are seeking a Senior Engineer 2 to join our AI Inference Data Plane team. In this role, you... ...can deploy and scale their models with industry-leading performance and... ...Familiarity with distributed inference serving frameworks such as llm‑d, NVIDIA Dynamo...SuggestedLocal areaRemote workWorldwideFlexible hours$160k - $230k
...LLM Inference Frameworks and Optimization Engineer San Francisco, Singapore, Amsterdam About the Role At... ...scalable inference for large language models (LLMs). Our mission is to optimize... ...parallelism for high-performance serving. Apply CUDA graph optimizations...SuggestedFull time- ...Job Description Machine Learning Engineer, Inference Want to solve realtime inference problems... ..., and making state-of-the-art speech models actually behave correctly in realtime... ...TensorRT, Triton, ONNX Runtime, and custom serving systems Managing KV cache systems,...SuggestedRemote workFlexible hours
$350k
...growing group of committed researchers, engineers, policy experts, and business leaders... ...systems. About the Role Anthropic's inference fleet serves Claude to millions of users across our... ...tightly coupled: accelerator kernels, model servers, distributed routing,...SuggestedWork at officeVisa sponsorshipFlexible hours- A leading data and AI company in San Francisco is seeking a Senior Engineer to enhance their Model Serving platform. This role requires expertise in building large-scale distributed systems and collaboration across teams to optimize performance and reliability. Ideal candidates...
- ...Responsibilities: Turbocharge our serving layer, consisting of a variety of LLM, speech, and vision models. Partner with ML infrastructure and training engineers to build a fast, cost-effective,... ...and custom kernels to speed up inference. Find ways to reduce model...Full timeContract workFlexible hours
$220k - $320k
...Help us make inference blazingly fast. If you love squeezing every... ...and hosts specialized language models for companies that need frontier... ...-funded ten-person team of engineers who work in-person in... ...approaches, always with the goal of serving models faster and cheaper at...Work at office- ...About the Team Our Inference team brings OpenAI's most capable research and technology... ...use and access our start-of-the-art AI models, allowing them to do things that they've... ...About the Role We are looking for an engineer who wants to take the world's largest...
- ...compute into useful intelligence - the inference services that serve LLMs at scale and the data pipelines... ...about both. Researchers and ML engineers will hand you workloads that barely run... ...matter. Responsibilities Serve Models at Scale: Design and operate...Flexible hours
$192k - $260k
...improve their business. Databricks' Model Serving product provides enterprises with a... .... It offers real-time, low-latency inference, governance, monitoring, and lineage. As... ...SLAs and cost efficiency. As a Staff Engineer, you'll play a critical role in shaping...Local areaWorldwide$192k - $260k
...improve their business. Foundation Model Serving is the API Product for hosting and serving frontier AI model inference for open source models like Llama, Qwen,... ...experience is necessary. We're looking for engineers who have owned high scale operational sensitive...Local areaWorldwide$240k - $400k
...artificial intelligence. Role Summary As our Founding Engineer, you will own a zero-to-one product and its... ...generation. Familiarity with LangGraph is a plus. Stand up inference paths with low latency serving and token-level observability Productionize prompt...Visa sponsorship$175k - $275k
...centric design. We are seeking an Agentic Engineer with over 6 years of experience and a... ...transform how our platform operates and serves customers. Key Responsibilities... ...databases, embedding systems, and real-time inference. Experience with agent architecture patterns...$280k
...group of committed researchers, engineers, policy experts, and business... ...OS internals Language modeling with transformers Representative... ...our models to low-precision inference Write a custom load-balancing algorithm to optimize serving efficiency Build...Work at officeVisa sponsorshipFlexible hours$280k
...group of committed researchers, engineers, policy experts, and business... ...possible with large language models. You'll be responsible for... ...capabilities and dramatically improve inference efficiency. Working at the... ...bottlenecks in production serving infrastructure Partner...Work at officeVisa sponsorshipFlexible hours$300k
...technology firm in San Francisco seeks a GPU Optimisation Engineer to maximize GPU performance in real-time AI systems.... ...of GPU execution, and a knack for optimizing inference latency for large generative models. With a competitive base salary of up to ~$300,000 and...Visa sponsorshipRelocation package- ...Location: Remote Role Description If you’re a senior construction engineering professional who thrives on precision, constructability, field... ...work. You’ll challenge and evaluate advanced language models on construction engineering topics to strengthen model reasoning...For contractorsRemote work
- ...do. We're pioneering the model architectures that will make... ...model innovation and systems engineering paired with a design-minded product... ...Role We're hiring an Inference Engineer to advance our... ...reliable model inference and serving stack for our cutting edge foundation...Work at officeVisa sponsorshipFlexible hours
- ...GPU Kernel Engineer Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding... ...large-scale training and inference. This role is ideal for someone...Flexible hours
- ...is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers... ...day is currently Tuesday. Product Engineering at Lambda is responsible for building... ...systems and supporting AI training and inference at scale. Lambda's Infrastructure Engineering...Work at officeLocal areaWork from homeFlexible hours
- ...Baseten powers mission-critical inference for the world's most dynamic... ...of AI to bring cutting-edge models into production. We're... ...and help build the platform engineers turn to to ship AI products.... ...spans distributed systems, model serving, and developer experience. You...Full timeFlexible hours
- ...Sciforium's Next-Generation Model Serving Platform Architect Sciforium is an AI infrastructure... ...AMD with hands-on support from AMD engineers the team is scaling rapidly to build... ...batching, scheduling, and distributed inference systems. Develop high-performance C++...Work at officeFlexible hours
$216k - $270k
...As a Software Engineer on the ML Infrastructure team, you will design... ..., reliable, and efficient serving of LLMs. Our platform powers... ...engineers to integrate and optimize models for production and research... ...-LLM, or text-generation-inference. Compensation packages...Full time$155k - $245k
...from batteries we already have. Project Engineer, Energy Storage Position Summary:... ...and escalate schedule or cost risks Serve as the primary interface between Business... ...professional or employment information, and inferences drawn from your PI. We collect your PI...Full timeCasual workWork at officeLocal area$187.5k - $247.5k
.... Staff Mechanical Design Engineer, EPC Redwood Materials is... ...Responsibilities will include: Serve as Mechanical Engineer of... ...Pipe-Flo (or other hydraulic modeling application), Caesar II, AspenTech... ...employment information, and inferences drawn from your PI. We...Full timeWork experience placement- ...that our platform delivers AI inference. Validating whether inference... ...looking for a dedicated QA engineer who can own the product's quality... ...AI inference quality, model deployments, and integrations... ...~ Working knowledge of LLM serving. ~ Strong experience testing...WorldwideFlexible hours
- ...journey to do our best. Helping the customers and businesses we serve to make better and smarter financial decisions and enabling the... ...statistical Treasury Risk and Pre-provision Net Revenue (PPNR) models. Regularly reviews model monitoring reports. The models may cover...Temporary work3 days per week
$170.26k - $200.3k
...journey to do our best. Helping the customers and businesses we serve to make better and smarter financial decisions and enabling the... ...One. Job Description U.S. Bank is seeking an experienced Model Validation Manager to lead validation efforts for our Wholesale...Temporary workLocal area3 days per week$405k
...growing group of committed researchers, engineers, policy experts, and business leaders working... ...* Architect eval frameworks that measure model capabilities across diverse coding tasks... ...them—and drive them to completion * Serve as a senior technical bridge between...Work at officeVisa sponsorshipFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Engineer, Inference & Model serving. Be the first to apply!
Related searches


