Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Software Engineer, Inference - Performance Optimization

AI Chopping Block, Inc.

About the Team Our team analyzes inference stack performance across the application, model, and fleet layers to identify bottlenecks and drive faster, cheaper inference. We combine systems profiling, benchmarking, and analysis to understand where time and cost are spent, then turn that understanding into performance optimizations and models that project performance and capacity needs for future launches. About the Role In this role, you will model inference performance across application, model, and fleet layers with higher fidelity. You will build cost‑to‑serve estimates from microbenchmarks and create tools that help cross‑functional teams reason about latency, capacity, utilization, and cost tradeoffs. In this role, you will: Build and refine performance models that translate microbenchmark results into cost‑to‑serve estimates. Analyze inference workloads end to end across applications, models, and fleet infrastructure. Enhance tooling to identify bottlenecks across layers for latency and throughput. Partner with other teams to turn performance insights into concrete improvements and project how future changes affect inference. You might thrive in this role if you: Enjoy reasoning from first principles about distributed systems, model inference, and hardware efficiency. Are comfortable working across abstraction layers, from application behavior to kernels, accelerators, networking, and fleet scheduling. Have deep expertise with performance profiling, benchmarking, analysis, and optimization. Enjoy collaborating with engineering and research teams to improve real production systems. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general‑purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link. #J-18808-Ljbffr AI Chopping Block, Inc.

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Software Engineer, Inference - Performance Optimization in San Francisco, CA vacancy
  •  ...ML Systems Engineer — Training & Inference Optimization (MBMB) We are building large-scale embodied intelligence...  ...robot foundation models, high-performance training infrastructure, and on-device...  ...boundaries across hardware, software, and model design — where improvements... 
    Performance

    Seer

    San Francisco, CA
    1 day ago
  • $187.5k - $395k

     ...Software Engineer, Inference Luma's mission is to build multimodal AI to expand human imagination...  ...and infrastructure to streamline and optimize model efficiency and deployments...  ...Infiniband, NVLink) ~ Experience with high performance large scale ML systems ( ~100 GPUs)... 
    Performance

    Luma AI

    San Francisco, CA
    3 days ago
  •  ...About the Job We are seeking a highly technical Inference Engine Engineer to optimize the performance and efficiency of our core inference engine. In...  ...performance Analyze performance bottlenecks across the software and hardware stack, and implement targeted... 
    Performance
    Worldwide
    Flexible hours

    FriendliAI Corp

    San Francisco, CA
    3 days ago
  • $142.2k - $204.6k

     ...P-1284 About This Role As a software engineer for GenAI inference, you will help design, develop, and optimize the inference engine that powers Databricks' Foundation...  ...background (3+ years or equivalent) in performance-critical systems Solid understanding of... 
    Performance
    Local area
    Worldwide

    Databricks

    San Francisco, CA
    13 hours ago
  •  ...About the Team Our Inference team brings OpenAI's...  ...to before. We focus on performant and efficient model...  ...We are looking for an engineer who wants to take the...  ...capable AI models and optimize them for use in a high...  ...years of professional software engineering experience... 
    Performance

    OpenAI

    San Francisco, CA
    13 hours ago
  •  ...About the Role We are hiring Software Engineers focused on AI Infrastructure to build the systems that enable frontier...  ...- including GPU orchestration, large-scale inference systems, performance optimization, and developer platforms that allow applied scientists... 
    Performance
    Internship
    Immediate start

    SpreeAI

    San Francisco, CA
    13 hours ago
  • $300k

     ...committed researchers, engineers, policy experts, and...  ...the role Our Inference team is responsible for...  ...scientists the high-performance inference infrastructure...  ...Have significant software engineering experience...  ...systems LLM inference optimization, batching, and... 
    Performance
    Work at office
    Worldwide
    Visa sponsorship
    Flexible hours

    anthropic

    San Francisco, CA
    3 days ago
  •  ...small, fast-growing team of engineers in San Francisco powering Fortune...  ...-latency, high-throughput inference for OCR and multimodal...  ...smart batching and caching Optimize kernels, tokenization, and model...  ...control with clear SLOs Own performance dashboards and capacity planning... 
    Performance
    Work at office
    Visa sponsorship
    Relocation package

    Pulse

    San Francisco, CA
    4 days ago
  •  ...About the Team Our Inference team brings OpenAI’s most capable research and technology...  ...been able to before. We focus on performant and efficient model inference, as...  .... About the Role We’re hiring engineers to scale and optimize OpenAI’s inference infrastructure across... 
    Performance
    Full time

    OpenAI

    San Francisco, CA
    9 hours ago
  •  ...Baseten powers mission-critical inference for the world's most dynamic...  ...and help build the platform engineers turn to to ship AI products....  ...Deployed Engineers, Model Performance Engineers, and sister...  ...runtime tuning, and server-level optimizations. Build large-scale, real-... 
    Performance
    Full time
    Flexible hours

    Baseten

    San Francisco, CA
    13 hours ago
  • $165k

     ...what's next. About the Role Inference is now the defining cost...  ...systems, model optimization, and serving infrastructure...  ...initial configuration and performance tuning to production SLA maintenance...  ...5+ years of professional software engineering experience with a track record... 
    Performance
    Local area

    Fluidstack

    San Francisco, CA
    2 days ago
  •  ...powers mission‑critical inference for the world's most dynamic...  ...help build the platform engineers turn to to ship AI...  ...working across product, software development, performance engineering, and customer...  ...outcomes for our customers. Optimize and enhance AI/ML projects... 
    Performance
    Work experience placement
    Flexible hours

    Baseten

    San Francisco, CA
    1 day ago
  • $175k - $225k

     ...led by veteran operators and engineers, alumni of Sonos, Paypal,...  ...We're looking for an AI Inference Engineer who lives at the boundary of high-performance software and physical hardware. In this...  ...with CUDA kernels, TensorRT optimizations, and the challenge of deploying... 
    Performance
    Local area
    Remote work

    Sauron

    San Francisco, CA
    2 days ago
  • $160k - $250k

     ...Senior Backend Engineer, Inference Platform San Francisco About the...  ...If you get a thrill from optimizing latency down to the last millisecond...  ...boundaries of inference performance and efficiency. Shape...  .... ~ Familiarity with GPU software stacks (CUDA, Triton, NCCL)... 
    Performance
    Full time
    Local area

    Together AI

    San Francisco, CA
    3 days ago
  • Fathom is seeking a Model Performance Engineer in San Francisco to optimize the speed, cost, and reliability of its model inference stack while building fine-tuning infrastructure. The ideal candidate will have extensive experience with LLM frameworks, quantization techniques... 
    Performance

    Fathom

    San Francisco, CA
    3 days ago
  • $197.3k - $225.1k

     ...Lead AI Engineer (FM Hosting, LLM Inference) Overview At Capital One, we are...  ...experiences and scalable, high-performance AI infrastructure. At...  ..., deploy, and support AI software components including foundation...  ...state-of-the-art LLM optimization techniques to improve the... 
    Performance
    Full time
    Part time
    Local area

    Capital One Financial Corp

    San Francisco, CA
    7 days ago
  • $380k

     ...benefit. About the Role We're looking for a GPU Inference Engineer to contribute to improvements in model serving...  ...is a high-impact role where you'll drive initiatives to optimize inference performance and scalability. You'll also be engaged in model design... 
    Performance
    Work at office
    Relocation package

    OpenAI

    San Francisco, CA
    3 days ago
  •  ...Analyze and model system performance, identifying...  ...Build and lead a team of engineers responsible for implementing...  ...the low-level inference stack, including kernel...  ...Have designed or optimized high-performance compute...  ...performance-critical software such as CUDA kernels,... 
    Performance
    Work at office
    Relocation package

    OpenAI

    San Francisco, CA
    1 day ago
  •  ...Staff Technical Lead for Inference & ML Performance San Francisco fal is the generative media...  ...shape the future of fal's inference engine and ensure our generative models achieve...  ...inference performance enhancements and optimizations. - You regularly ship code that... 
    Performance

    Fal

    San Francisco, CA
    2 days ago
  • Qualifications CUDA + GPU inference optimization vLLM, SGLang, or TensorRT-LLM experience KV caching, paged attention, batching, token streaming...  ...paged attention, sequence packing, etc. Conducting model performance reviews Improve scheduler, batcher, autoscaling; profile... 
    Performance

    SupportFinity™

    San Francisco, CA
    1 day ago
  •  ...Location Type Hybrid Department Inference Model Serving Who are we?...  ...is a team of researchers, engineers, designers, and more, who are...  ...energized by building high-performance, scalable and reliable machine...  ...with many teams to deploy optimized NLP models to production in... 
    Performance
    Full time
    Work experience placement
    Work at office
    Remote work
    Flexible hours

    Jaide Health

    San Francisco, CA
    4 days ago
  • $220k

    We build and run the inference engine behind every Perplexity query and...  ...rapidly growing traffic. Performance optimisation. Profile and fix...  ...3+ years of professional software engineering experience with...  ...architectures and inference optimization techniques (e.g. quantization... 
    Performance

    Perplexity

    San Francisco, CA
    1 day ago
  •  ...Chopping Block, Inc. is looking for a specialized role to model inference performance across application, model, and fleet layers....  ...performance models and analyzing inference workloads to identify and optimize bottlenecks. Ideal candidates will have deep expertise in performance... 
    Performance

    AI Chopping Block, Inc.

    San Francisco, CA
    1 day ago
  • $300k

     ...technology firm in San Francisco seeks a GPU Optimisation Engineer to maximize GPU performance in real-time AI systems. The ideal candidate will...  ...deep understanding of GPU execution, and a knack for optimizing inference latency for large generative models. With a... 
    Performance
    Visa sponsorship
    Relocation package

    Trades Workforce Solutions

    San Francisco, CA
    4 days ago
  • Liquid AI is seeking a Systems Programmer to join their Edge Inference team in San Francisco. In this role, you will implement and optimize inference kernels on various hardware, ensuring efficiency and performance. Ideal candidates have over 5 years of systems programming... 
    Performance
    Flexible hours

    Liquid AI

    San Francisco, CA
    3 days ago
  • $170k - $216k

     ...evaluate the Waymo Driver's software stack at a massive...  ...of customers Software Engineers, Product, Data Science...  ...Build and evolve ML inference infrastructure for simulations...  ...frameworks, TPUs and optimizing models for serving....  ..., if the role can be performed remote, the specific... 
    Full time
    Remote work

    Waymo

    San Francisco, CA
    1 day ago
  •  ...hardware systems to accelerate AI inference. These inference systems offer significant performance and efficiency gains over...  ...inference systems. Senior Software Engineer – Machine Learning Systems &...  ...relies on templates, SIMD optimizations, and efficient parallel computing... 
    Performance

    GrabJobs

    San Francisco, CA
    13 hours ago
  •  ...powered workforce management that optimizes both human and AI capacity,...  ..., data pipelines, and inference servers to predict support contact...  ...with ML packages and software: Experience using Python libraries...  ...team. Passion for performance: A strong commitment to advancing... 
    Performance

    AssembledHQ, Inc

    San Francisco, CA
    3 days ago
  • $200k - $300k

     ...Senior Platform Engineer – AI Infrastructure $200-$300k base + Equity (depending...  ...serving infrastructure, workload optimization, and platform performance Improve networking and connectivity...  ...exposure Experience with AI inference or model-serving systems Real-time... 
    Performance

    Harrison Clarke

    San Francisco, CA
    1 day ago
  •  ...Senior Software Engineer, LLM Performance SF Bay Area (Hybrid) Parasail is redefining AI infrastructure...  ...a distributed network of GPUs, optimizing for cost, performance, and flexibility...  ...-source projects. Contributions to inference engines such as vLLM is a strong... 
    Performance

    Parasail

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer, Inference - Performance Optimization. Be the first to apply!