Member of Technical Staff - Model Optimization and Inference (Experienced)
$250k - $350kNuance Labs, Inc.
About Nuance Labs Nuance Labs is building photorealistic, real-time AI avatars with emotional intelligence: a full-duplex audiovisual system that can listen, speak, react, interrupt, and respond like a real person. We're a Series A company ($60M raised) backed by Lightspeed, Accel, South Park Commons, NVentures, and Define Ventures, with PhDs from MIT, UW, Oxford, CMU, and Johns Hopkins, and industry experience from Apple, Meta, Amazon AGI, and Discord. The team is small, the work is real, and the problems are unsolved.
How Nuance Differentiates Most conversational AI avatars today are hacks - a face slapped on a speech-to-speech pipeline, stuck in the uncanny valley: emotionless, mechanical, one-turn-at-a-time. Current systems take 2-5 seconds to respond; natural conversation requires sub-500ms. That's a 10x improvement, and it demands rethinking the entire stack. That rethinking starts with full-duplex: an AI that listens and speaks simultaneously, perceives emotion in real time, and responds with a face that actually reflects it. It's an extremely hard problem, and we're developing foundation models designed for it from the ground up. About the Role We can train a great model. The next problem is making it fast enough to actually use in a real-time conversation - and that gap is enormous. A model that responds in 3 seconds is a demo. A model that responds in under 500ms is a product. We're looking for someone who specializes in taking trained models and squeezing every last millisecond out of them. You understand the full stack from model weights to serving infrastructure - quantization, KV cache optimization, kernel-level acceleration, batching strategies - and you know which lever to pull for which problem. You've worked with vLLM, SGLang, or similar frameworks at scale and have strong opinions about where they fall short. This posting is aimed at experienced engineers and researchers who've operated at a senior to senior-staff level at big tech, a leading AI lab, or a high-traffic inference team. Everyone at Nuance is MTS - we don't run title ladders - but we're hiring people who have already done this work at scale. Our stack is more complex than a standard LLM deployment: we're serving a full-duplex multimodal system that must satisfy strict real-time latency constraints. There's a lot of unsolved optimization work here, and we need someone who finds that genuinely exciting.
What You'll Do
How Nuance Differentiates Most conversational AI avatars today are hacks - a face slapped on a speech-to-speech pipeline, stuck in the uncanny valley: emotionless, mechanical, one-turn-at-a-time. Current systems take 2-5 seconds to respond; natural conversation requires sub-500ms. That's a 10x improvement, and it demands rethinking the entire stack. That rethinking starts with full-duplex: an AI that listens and speaks simultaneously, perceives emotion in real time, and responds with a face that actually reflects it. It's an extremely hard problem, and we're developing foundation models designed for it from the ground up. About the Role We can train a great model. The next problem is making it fast enough to actually use in a real-time conversation - and that gap is enormous. A model that responds in 3 seconds is a demo. A model that responds in under 500ms is a product. We're looking for someone who specializes in taking trained models and squeezing every last millisecond out of them. You understand the full stack from model weights to serving infrastructure - quantization, KV cache optimization, kernel-level acceleration, batching strategies - and you know which lever to pull for which problem. You've worked with vLLM, SGLang, or similar frameworks at scale and have strong opinions about where they fall short. This posting is aimed at experienced engineers and researchers who've operated at a senior to senior-staff level at big tech, a leading AI lab, or a high-traffic inference team. Everyone at Nuance is MTS - we don't run title ladders - but we're hiring people who have already done this work at scale. Our stack is more complex than a standard LLM deployment: we're serving a full-duplex multimodal system that must satisfy strict real-time latency constraints. There's a lot of unsolved optimization work here, and we need someone who finds that genuinely exciting.
What You'll Do
- Own end-to-end inference optimization across our model stack - LLMs, audio models, and diffusion-based components
- Implement and tune KV cache strategies for long-context conversations, including eviction policies, compression, and memory-efficient attention
- Evaluate, deploy, and extend inference serving frameworks (vLLM, SGLang, TensorRT-LLM, etc.) for our specific workloads
- Profile and benchmark end-to-end latency and throughput; identify and systematically eliminate bottlenecks
- Build internal tooling that makes optimization work faster and more rigorous - profiling viewers, end-to-end inference test harnesses, and other infrastructure that helps the team move quickly
- Accelerate diffusion model inference - consistency models, step distillation, caching strategies, and custom kernel optimizations
- Apply and develop quantization techniques (INT8, INT4, GPTQ, AWQ, and beyond) to reduce memory footprint and increase throughput without meaningfully degrading quality
- Work closely with research and infrastructure to ensure new models ship with optimized serving from day one
- Significant hands-on experience with LLM inference optimization - you've shipped work on KV caching, memory layout, attention kernels, or batching strategies in a production or high-traffic research context
- Proven proficiency with inference serving frameworks - vLLM, SGLang, TensorRT-LLM, or similar - including going well beyond default configurations and adapting them to non-standard workloads
- Experience optimizing diffusion model inference (latency reduction, step distillation, caching, or kernel-level work)
- Strong Python and PyTorch skills; comfort reading and writing CUDA or Triton kernels is a significant plus
- A systematic approach to profiling and optimization - you measure first, then optimize
- Familiarity with speculative decoding or other inference-time acceleration techniques
- Hands-on experience with post-training quantization (GPTQ, AWQ, or similar) and a clear sense of quality/performance tradeoffs
- Familiarity with multimodal or streaming inference architectures
- Experience deploying real-time AI systems with hard latency SLAs
- Prior work at an AI lab, inference startup, or on a high-traffic model serving platform
- Contributions to open-source inference frameworks
- Location: In-person in Seattle, five days a week - we believe in the compounding value of working shoulder-to-shoulder.
- Visa sponsorship: We sponsor visas (O-1, H-1B, green card) from day one.
- AI-native tooling: Do your best work with the best tools, including unlimited tokens.
- Health: HSA plan with ~$2,000 in annual company contributions - roughly 2x what most big tech companies put in.
- Time off: 15 days of PTO plus public holidays, and we close the office for a full week at year-end.
- Food: Lunch, drinks, and snacks on us every workday - the small thing that quietly makes the day better.
- Commuter benefits: We help cover the cost of getting to the office.
- 401(k): In the works.
Vacancy posted 14 hours ago
Similar jobs that could be interesting for youBased on the Member of Technical Staff - Model Optimization and Inference (Experienced) in Seattle, WA vacancy
$200k - $300k
...problem, and we're developing foundation models designed for it from the ground up.... ...infrastructure: quantization, KV cache optimization, kernel-level acceleration, batching strategies... ...You’ll Do Contribute to end-to-end inference optimization across our model stack — LLMs...SuggestedFull timeInternshipH1bWork at officeVisa sponsorship$300k
...developing foundation models designed for it... ...looking for a deeply technical Member of Technical Staff to own RL and post-... ...is aimed at experienced researchers and engineers... ...reward modeling, policy optimization, evaluation, data... ...-scale training or inference systems, including...SuggestedH1bWork at officeVisa sponsorshipShift work$180k
...Member Of Technical Staff - Imagine Model Palo Alto, CA; Seattle, WA About XAI XAI's mission is to create... ...data curation, modeling, training, inference serving, and product integration,... ...learning systems. Ability to deliver optimal end-to-end user experiences....SuggestedTemporary work$232.56k - $427.5k
...team has research groups dedicated to generative models for content creation, image generation, video synthesis... ...editing, and virtual humans. We are seeking an experienced Multimodal Model Training and Inference Optimization Engineer with expertise in optimizing AI model...SuggestedTemporary workLocal area$180k
...Member Of Technical Staff - Model Training Austin, TX; New York, NY; Palo Alto, CA; Seattle, WA About XAI XAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly...SuggestedTemporary work$242k - $290k
...Model Optimization & Deployment Engineer The Perception team is pioneering the development of a multi-modality foundation model to drive... ...models, write custom CUDA kernels, and build highly concurrent inference code to ensure real-time, deterministic execution on edge...Temporary workRelocation package$300k - $400k
...problem, and we're developing foundation models designed for it from the ground up.... ...the Role We're looking for a deeply technical MTS to own distributed training infrastructure... ...for long-running training jobs. Optimize large-scale training performance across...H1bWork at officeVisa sponsorship$150k
...Member Of Technical Staff Join the next revolution in robotics at Amazon's Frontier... ...of breakthrough foundation models that enable robots to... ...action models, efficient model inference, video tokenization... ...with engineering teams to optimize and scale models for real-world...Full timeTemporary workSeasonal workLocal area$159.75k - $255.6k
Sr. Full Stack Member of Technical Staff Seattle, Washington, United States Join... ...the full stack, from data, models, and infrastructure to... ...constrained devices. Architect and optimize full‑stack AI pipelines.... ...platforms for large‑scale inference and training. Strong...Work at office$180k
...be able to concisely and accurately share knowledge with their teammates. ABOUT THE ROLE: You will work on the most critical modeling challenges at any given time. You will get clarity on your first project before an offer. BASIC QUALIFICATIONS: You...Temporary work$200k - $250k
...'re developing foundation models designed for it from the ground... ..., generative modeling, or inference. Depending on your... ...to convert to a full-time Member of Technical Staff role. Fellows who convert... ...appetite to pick up anything and optimize the hell out of it BONUS...Full timeTemporary workInternshipH1bWork at officeVisa sponsorship- Bright Vision Technologies is seeking a Model Serving Engineer to design and operate highly reliable inference platforms for large machine learning models. This remote... ...-driven environment. Responsibilities include optimizing model performance, integrating with API...Remote jobFull time
$120k - $150k
# Member of Technical Staff (AI-Powered EdTech)Colleague AI$120K - $1600KKirkland, WA, USSeniorAI/ML EngineerInterested... ...**. We fully integrate **the best AI models and tools** into our **product design... ...and cloud technologies.* Build and optimize **AI\-driven features** for...Permanent employmentFull timeFlexible hours$200k - $300k
Member of Technical Staff — ML Data Infra Seattle, Washington About Nuance Labs Nuance Labs is building... ..., and we're developing foundation models designed for it from the ground up. About... ...and without losing correctness Optimize pipeline throughput and efficiency at...$139.5k - $258.1k
Large Machine Learning Model Optimization Engineer Seattle, Washington, United States Software and Services Our team is an applied research... ...High performance kernel implementation Distributed inference At Apple, base pay is one part of our total compensation package...Relocation$79.2k - $178.1k
...the Oracle Cloud to provide the broadest, most comprehensive cloud in the industry. Responsibilities As a Senior Member of Technical Staff, you will own the software design and development for major components of Oracle's Cloud Infrastructure. You should be both...Temporary workWorldwideFlexible hours$96.8k - $223.4k
...innovation and excellence. As a valued member of our software engineering division in... ...experiences. Collaborate and lead technical discussions across multiple teams to... ...design principles Data management: data modeling, data warehousing, data governance...Temporary workRemote workFlexible hours- ...systems challenges, and help deliver the foundation for OCI’s most performant compute services. Responsibilities As a Senior Member of Technical Staff, you will own the software design and development for major components of Oracle’s Cloud Infrastructure. You should be a...Temporary workWorldwideFlexible hours
- ...art and science. We believe that world models are at the frontier of progress in artificial... ...state management, and performance optimization on resource‑constrained devices.... ...considering candidates who may be more or less experienced than outlined in the job description....Remote work
- ...Python Infrastructure Engineer - Model Evaluation (AI Training)... ...ll Do Design, build, and optimize high-performance Python systems... ...ML models, integrating with inference frameworks Improve... ...production-grade Python ~ Experienced building evaluation harnesses...Hourly payOngoing contractContract workFreelanceRemote workFlexible hours
- Apple Inc. in Seattle, Washington, is seeking an experienced Machine Learning Engineer to join the Foundation Model Services team. You will work closely with product teams to build solutions that launch models for millions of customers in real time. The ideal candidate...
- Senior Software Developer OCI Compute is looking for strong Senior Software Developers with a strong cloud/distributed systems/microservices background to take on the challenge of engineering Compute Infrastructure solutions and build services for Large Scale Compute...Flexible hours
$57 per hour
...reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models. Conduct research on infrastructure and... ...related to large‑scale systems, inference optimization, compilers, or performance optimization....Hourly payInternshipLocal area$26 per hour
Aston Carter is seeking a Supply Chain Analyst based in Seattle, WA. This contract position entails driving continuous improvement in processes while problem-solving and designing solutions for the supply chain network. Candidates should have a Bachelor's degree in Engineering...Contract work$79.2k - $209.5k
...growing fast, still at an early stage, and working on ambitious new initiatives. An engineer at any level can have a significant technical and business impact. The ideal software engineer candidate for this team is a proficient programmer who has large breadth of knowledge...Temporary workLocal areaFlexible hours- ...Seattle is seeking a skilled Support Analyst to provide exceptional support for members using their AI-powered solutions. The ideal candidate will have 3-5 years of experience in a technical role, with expertise in financial services technology. Responsibilities include...
- ...differently. You do not accept the status-quo. You challenge the current model of the world and take leaps of faith to build it better for... ...improvements to the Spice.ai OSS project. 30‑60 days - take technical and engineering ownership of an entire feature area. 60‑90...
- Join Oracle Cloud Infrastructure’s Compute team to design, build, and scale the next generation of bare-metal provisioning systems powering millions of servers worldwide. As a senior engineer, you will develop highly reliable and secure infrastructure, tackle complex distributed...WorldwideFlexible hours
$180k
Job Description Job Description About xAI xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization ...Temporary work$350k
...Role Anthropic's production models undergo sophisticated post-training... ...Implement and optimize post‑training techniques at scale... ...Policy Currently, we expect all staff to be in one of our offices at... ...underrepresented groups are more prone to experiencing imposter syndrome and...Work at officeVisa sponsorshipFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Member of Technical Staff - Model Optimization and Inference (Experienced). Be the first to apply!
Related searches
- IT assistant Seattle, WA
- desktop support analyst Seattle, WA
- senior IT support technician Seattle, WA
- personal computer support technician Seattle, WA
- technical analyst Seattle, WA
- customer support technician Seattle, WA
- tech assistant Seattle, WA
- technical support assistant Seattle, WA
- customer support analyst Seattle, WA
- remote (work from home) technical support representative Seattle, WA


