Software Engineer, Inference - Performance Optimization

AI Chopping Block, Inc.

About the Team Our team analyzes inference stack performance across the application, model, and fleet layers to identify bottlenecks and drive faster, cheaper inference. We combine systems profiling, benchmarking, and analysis to understand where time and cost are spent, then turn that understanding into performance optimizations and models that project performance and capacity needs for future launches. About the Role In this role, you will model inference performance across application, model, and fleet layers with higher fidelity. You will build cost‑to‑serve estimates from microbenchmarks and create tools that help cross‑functional teams reason about latency, capacity, utilization, and cost tradeoffs. In this role, you will: Build and refine performance models that translate microbenchmark results into cost‑to‑serve estimates. Analyze inference workloads end to end across applications, models, and fleet infrastructure. Enhance tooling to identify bottlenecks across layers for latency and throughput. Partner with other teams to turn performance insights into concrete improvements and project how future changes affect inference. You might thrive in this role if you: Enjoy reasoning from first principles about distributed systems, model inference, and hardware efficiency. Are comfortable working across abstraction layers, from application behavior to kernels, accelerators, networking, and fleet scheduling. Have deep expertise with performance profiling, benchmarking, analysis, and optimization. Enjoy collaborating with engineering and research teams to improve real production systems. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general‑purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link. #J-18808-Ljbffr AI Chopping Block, Inc.

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Software Engineer, Inference - Performance Optimization in San Francisco, CA vacancy

Software Engineer: ML Optimization
...ML Systems Engineer — Training & Inference Optimization (MBMB) We are building large-scale embodied intelligence... ...robot foundation models, high-performance training infrastructure, and on-device... ...boundaries across hardware, software, and model design — where improvements...
Performance
Seer
San Francisco, CA
1 day ago
Software Engineer, Inference
$187.5k - $395k
...Software Engineer, Inference Luma's mission is to build multimodal AI to expand human imagination... ...and infrastructure to streamline and optimize model efficiency and deployments... ...Infiniband, NVLink) ~ Experience with high performance large scale ML systems ( ~100 GPUs)...
Performance
Luma AI
San Francisco, CA
3 days ago
Software Engineer - AI Inference Engine
...About the Job We are seeking a highly technical Inference Engine Engineer to optimize the performance and efficiency of our core inference engine. In... ...performance Analyze performance bottlenecks across the software and hardware stack, and implement targeted...
Performance
Worldwide
Flexible hours
FriendliAI Corp
San Francisco, CA
3 days ago
Software Engineer - GenAI inference
$142.2k - $204.6k
...P-1284 About This Role As a software engineer for GenAI inference, you will help design, develop, and optimize the inference engine that powers Databricks' Foundation... ...background (3+ years or equivalent) in performance-critical systems Solid understanding of...
Performance
Local area
Worldwide
Databricks
San Francisco, CA
13 hours ago
Software Engineer, Model Inference
...About the Team Our Inference team brings OpenAI's... ...to before. We focus on performant and efficient model... ...We are looking for an engineer who wants to take the... ...capable AI models and optimize them for use in a high... ...years of professional software engineering experience...
Performance
OpenAI
San Francisco, CA
13 hours ago
Software Engineer Intern (AI Infrastructure / Training / Inference)
...About the Role We are hiring Software Engineers focused on AI Infrastructure to build the systems that enable frontier... ...- including GPU orchestration, large-scale inference systems, performance optimization, and developer platforms that allow applied scientists...
Performance
Internship
Immediate start
SpreeAI
San Francisco, CA
13 hours ago
Staff + Sr. Software Engineer, Inference
$300k
...committed researchers, engineers, policy experts, and... ...the role Our Inference team is responsible for... ...scientists the high-performance inference infrastructure... ...Have significant software engineering experience... ...systems LLM inference optimization, batching, and...
Performance
Work at office
Worldwide
Visa sponsorship
Flexible hours
anthropic
San Francisco, CA
3 days ago
Software Engineer, Inference
...small, fast-growing team of engineers in San Francisco powering Fortune... ...-latency, high-throughput inference for OCR and multimodal... ...smart batching and caching Optimize kernels, tokenization, and model... ...control with clear SLOs Own performance dashboards and capacity planning...
Performance
Work at office
Visa sponsorship
Relocation package
Pulse
San Francisco, CA
4 days ago
Software Engineer, Inference AMD GPU Enablement
...About the Team Our Inference team brings OpenAI’s most capable research and technology... ...been able to before. We focus on performant and efficient model inference, as... .... About the Role We’re hiring engineers to scale and optimize OpenAI’s inference infrastructure across...
Performance
Full time
OpenAI
San Francisco, CA
9 hours ago
Software Engineer - Voice AI (Inference Runtime)
...Baseten powers mission-critical inference for the world's most dynamic... ...and help build the platform engineers turn to to ship AI products.... ...Deployed Engineers, Model Performance Engineers, and sister... ...runtime tuning, and server-level optimizations. Build large-scale, real-...
Performance
Full time
Flexible hours
Baseten
San Francisco, CA
13 hours ago
Software Engineer, Inference Platform
$165k
...what's next. About the Role Inference is now the defining cost... ...systems, model optimization, and serving infrastructure... ...initial configuration and performance tuning to production SLA maintenance... ...5+ years of professional software engineering experience with a track record...
Performance
Local area
Fluidstack
San Francisco, CA
2 days ago
Applied AI Inference Engineer
...powers mission‑critical inference for the world's most dynamic... ...help build the platform engineers turn to to ship AI... ...working across product, software development, performance engineering, and customer... ...outcomes for our customers. Optimize and enhance AI/ML projects...
Performance
Work experience placement
Flexible hours
Baseten
San Francisco, CA
1 day ago
AI Inference Engineer
$175k - $225k
...led by veteran operators and engineers, alumni of Sonos, Paypal,... ...We're looking for an AI Inference Engineer who lives at the boundary of high-performance software and physical hardware. In this... ...with CUDA kernels, TensorRT optimizations, and the challenge of deploying...
Performance
Local area
Remote work
Sauron
San Francisco, CA
2 days ago
Senior Backend Engineer, Inference Platform
$160k - $250k
...Senior Backend Engineer, Inference Platform San Francisco About the... ...If you get a thrill from optimizing latency down to the last millisecond... ...boundaries of inference performance and efficiency. Shape... .... ~ Familiarity with GPU software stacks (CUDA, Triton, NCCL)...
Performance
Full time
Local area
Together AI
San Francisco, CA
3 days ago
AI Inference Performance Engineer
Fathom is seeking a Model Performance Engineer in San Francisco to optimize the speed, cost, and reliability of its model inference stack while building fine-tuning infrastructure. The ideal candidate will have extensive experience with LLM frameworks, quantization techniques...
Performance
Fathom
San Francisco, CA
3 days ago
Lead AI Engineer (FM Hosting, LLM Inference)
$197.3k - $225.1k
...Lead AI Engineer (FM Hosting, LLM Inference) Overview At Capital One, we are... ...experiences and scalable, high-performance AI infrastructure. At... ..., deploy, and support AI software components including foundation... ...state-of-the-art LLM optimization techniques to improve the...
Performance
Full time
Part time
Local area
Capital One Financial Corp
San Francisco, CA
7 days ago
Inference Technical Lead, Sora
$380k
...benefit. About the Role We're looking for a GPU Inference Engineer to contribute to improvements in model serving... ...is a high-impact role where you'll drive initiatives to optimize inference performance and scalability. You'll also be engaged in model design...
Performance
Work at office
Relocation package
OpenAI
San Francisco, CA
3 days ago
Inference Technical Lead, On-Device Transformers
...Analyze and model system performance, identifying... ...Build and lead a team of engineers responsible for implementing... ...the low-level inference stack, including kernel... ...Have designed or optimized high-performance compute... ...performance-critical software such as CUDA kernels,...
Performance
Work at office
Relocation package
OpenAI
San Francisco, CA
1 day ago
Staff Technical Lead for Inference & ML Performance
...Staff Technical Lead for Inference & ML Performance San Francisco fal is the generative media... ...shape the future of fal's inference engine and ensure our generative models achieve... ...inference performance enhancements and optimizations. - You regularly ship code that...
Performance
Fal
San Francisco, CA
2 days ago
Cloud Inference Engineer
Qualifications CUDA + GPU inference optimization vLLM, SGLang, or TensorRT-LLM experience KV caching, paged attention, batching, token streaming... ...paged attention, sequence packing, etc. Conducting model performance reviews Improve scheduler, batcher, autoscaling; profile...
Performance
SupportFinity™
San Francisco, CA
1 day ago
Staff Software Engineer, Inference Infrastructure
...Location Type Hybrid Department Inference Model Serving Who are we?... ...is a team of researchers, engineers, designers, and more, who are... ...energized by building high-performance, scalable and reliable machine... ...with many teams to deploy optimized NLP models to production in...
Performance
Full time
Work experience placement
Work at office
Remote work
Flexible hours
Jaide Health
San Francisco, CA
4 days ago
Member of Technical Staff (AI Inference Engineer)
$220k
We build and run the inference engine behind every Perplexity query and... ...rapidly growing traffic. Performance optimisation. Profile and fix... ...3+ years of professional software engineering experience with... ...architectures and inference optimization techniques (e.g. quantization...
Performance
Perplexity
San Francisco, CA
1 day ago
Inference Performance Engineer: Cost & Latency Optimizer
...Chopping Block, Inc. is looking for a specialized role to model inference performance across application, model, and fleet layers.... ...performance models and analyzing inference workloads to identify and optimize bottlenecks. Ideal candidates will have deep expertise in performance...
Performance
AI Chopping Block, Inc.
San Francisco, CA
1 day ago
Real-Time GPU Inference Optimization Engineer
$300k
...technology firm in San Francisco seeks a GPU Optimisation Engineer to maximize GPU performance in real-time AI systems. The ideal candidate will... ...deep understanding of GPU execution, and a knack for optimizing inference latency for large generative models. With a...
Performance
Visa sponsorship
Relocation package
Trades Workforce Solutions
San Francisco, CA
4 days ago
Edge Inference Engineer: Optimize On-Device AI Kernels
Liquid AI is seeking a Systems Programmer to join their Edge Inference team in San Francisco. In this role, you will implement and optimize inference kernels on various hardware, ensuring efficiency and performance. Ideal candidates have over 5 years of systems programming...
Performance
Flexible hours
Liquid AI
San Francisco, CA
3 days ago
Software Engineer, ML Inference, Simulation Infrastructure
$170k - $216k
...evaluate the Waymo Driver's software stack at a massive... ...of customers Software Engineers, Product, Data Science... ...Build and evolve ML inference infrastructure for simulations... ...frameworks, TPUs and optimizing models for serving.... ..., if the role can be performed remote, the specific...
Full time
Remote work
Waymo
San Francisco, CA
1 day ago
Sr Software Engineer
...hardware systems to accelerate AI inference. These inference systems offer significant performance and efficiency gains over... ...inference systems. Senior Software Engineer – Machine Learning Systems &... ...relies on templates, SIMD optimizations, and efficient parallel computing...
Performance
GrabJobs
San Francisco, CA
13 hours ago
Software Engineer - Forecasting & Scheduling
...powered workforce management that optimizes both human and AI capacity,... ..., data pipelines, and inference servers to predict support contact... ...with ML packages and software: Experience using Python libraries... ...team. Passion for performance: A strong commitment to advancing...
Performance
AssembledHQ, Inc
San Francisco, CA
3 days ago
Senior Software Engineer
$200k - $300k
...Senior Platform Engineer – AI Infrastructure $200-$300k base + Equity (depending... ...serving infrastructure, workload optimization, and platform performance Improve networking and connectivity... ...exposure Experience with AI inference or model-serving systems Real-time...
Performance
Harrison Clarke
San Francisco, CA
1 day ago
Senior Software Engineer, LLM Performance
...Senior Software Engineer, LLM Performance SF Bay Area (Hybrid) Parasail is redefining AI infrastructure... ...a distributed network of GPUs, optimizing for cost, performance, and flexibility... ...-source projects. Contributions to inference engines such as vLLM is a strong...
Performance
Parasail
San Francisco, CA
3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer, Inference - Performance Optimization. Be the first to apply!