AI Platform Engineer, Training and Inference

Medium

AI Platform Engineer – Training & Inference Saviynt's AI-powered identity platform manages and governs human and non-human access to all of an organization's applications, data, and business processes. Customers trust Saviynt to safeguard their digital assets, drive operational efficiency, and reduce compliance costs. Built for the AI age, Saviynt is today helping organizations safely accelerate their deployment and usage of AI. Saviynt is recognized as the leader in identity security, with solutions that protect and empower the world's leading brands, Fortune 500 companies and government institutions. For more information, please visit The AI Platform team is building the compute layer that trains, evaluates, and serves every AI model at Saviynt. We need an ML Platform Engineer to own distributed training on Ray +H100s, the multi-engine LLM inference mesh (vLLM, SGLang, NVIDIA Triton), and the full model promotion lifecycle—from shadow mode through canary rollout to GA. The AI Platform team's mission is to build a secure, scalable, product-agnostic AI foundation that enables Saviynt's identity products to deliver measurable AI-powered outcomes. Training & Inference is the engine—it turns data into deployed models that make Saviynt's products smarter. What You Will Be Doing Own the Ray ecosystem end-to-end: manage KubeRay on GKE, tune Ray Core/Task/Actor scheduling, operate the Plasma distributed object store, and configure RayData for GPU-direct streaming from GCS/S3. Operate distributed training with Ray Train: configure TorchTrainer + DDP/NCCL for multi-node H100 clusters, manage checkpoint lifecycle, implement spot-preemption recovery, and integrate warm-start fine-tuning for retrain pipelines. Build and operate the LLM inference mesh with Ray Serve: compose vLLM (PagedAttention), SGLang (RadixAttention), and NVIDIA Triton (TensorRT/ONNX) as a unified deployment graph with Plasma zero-copy memory sharing. Optimise inference performance: configure fractional GPU allocation, enable continuous batching, implement per-engine autoscaling based on request queue depth, and tune KV-cache block sizes. Design and operate the model routing layer: capability-based, version-based, and tenant-based routing with cost-aware fallback between self-hosted SLMs and cloud LLMs. Build RL training infrastructure: define Flyte workflows for RL pipelines (rollout, reward shaping, policy update, evaluation), integrate Ray RLlib or custom PPO/GRPO loops with Ray Train, and manage replay buffer persistence on GCS. Operate the full model promotion lifecycle: quality gate to integration tests to load tests (k6) to shadow mode to A/B gate to canary (10% to 100%) with golden-signal auto-rollback. Operate the retrain pipeline: drift detection triggers, warm-start retraining, relative quality gates (V2 ≥ V1 - 2%), and automated Flyte DAG through to canary. Integrate RAG retrieval into the inference mesh: vector similarity search, context assembly, and prompt construction before LLM inference. What You Bring Experience in ML engineering with time in an ML platform or MLOps role. Production Ray depth: Ray Train, Serve, Core, and Data—debugged real production failures including NCCL timeouts, Plasma OOM, and Serve autoscaling lag. LLM serving engines: hands-on with vLLM, SGLang, or NVIDIA Triton—PagedAttention, prefix caching, and continuous batching tuned for latency/throughput targets. Distributed training: DDP, FSDP, NCCL collectives, gradient checkpointing, and mixed-precision (BF16/FP8). RL working knowledge: PPO, policy gradient, or RLHF—able to translate an algorithm into distributed compute primitives. Model lifecycle operations: MLflow registry, shadow/A/B/canary patterns, and auto-rollback on golden-signal degradation. Vector databases: Pgvector or Qdrant—ANN index strategies, embedding upsert, and query latency tuning under inference load. Strong Python and PyTorch; Flyte or equivalent ML orchestrator. Quantization (nice to have): INT8/INT4/FP8 post-training quantization (GPTQ, AWQ, or bitsandbytes). Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience or equivalent military experience. We offer you a competitive total rewards package, learning and tremendous opportunities to grow and advance in your career. At Saviynt, it is not typical for an individual to be hired at or near the top of the range for their role and final compensation decisions are dependent on many factors including, but not limited to, location; skill sets; experience and training; licensure and certifications; and other relevant business and organizational needs. You may also be eligible to participate in a Saviynt discretionary bonus plan, subject to the rules governing the program, whereby an award, if any, depends on various factors, including, without limitation, individual and organizational performance. #J-18808-Ljbffr Medium

Apply

Vacancy posted 10 hours ago

Similar jobs that could be interesting for youBased on the AI Platform Engineer, Training and Inference in San Francisco, CA vacancy

Lead AI Engineer (FM Hosting, LLM Inference)
$197.3k - $225.1k
...Lead AI Engineer (FM Hosting, LLM Inference) Overview At Capital One, we are creating responsible and... ...millions of customers. Our AI models and platforms empower teams across Capital One to... ...including foundation model training, large language model inference, similarity...
Training
Full time
Part time
Local area
Capital One Financial Corp
San Francisco, CA
3 days ago
AI Infrastructure Engineer — Scalable Training & Inference
An innovative AI company is seeking a Software Engineer to develop infrastructure that supports AI training and inference workflows. This role requires strong object-oriented programming skills and a solid foundation in data structures and algorithms. The ideal candidate...
Training
SpreeAI
San Francisco, CA
2 days ago
ML Platform Engineer: Training & Inference Engine
Saviynt, located in San Francisco, is seeking an AI Platform Engineer to manage and optimize the training and inference of AI models. You will lead efforts in operating the Ray ecosystem and distributed training on advanced GPU clusters. The ideal candidate has a solid...
Training
Medium
San Francisco, CA
20 hours ago
Senior AI Platform Engineer
$151.8k - $265.35k
...content with ease. The AI Foundations team builds the core AI platform that powers creativity across... .... We’re looking for an engineer to help develop and... ...including model integration, inference services, data pipelines... ...inference pipelines for training, evaluation, fine-tuning...
Training
Full time
Temporary work
Local area
Worldwide
Adobe
San Francisco, CA
13 hours ago
Sr. Lead AI Engineer (Inference Optimization, FM hosting, AI Platform)
$229.9k - $262.4k
Sr. Lead AI Engineer (Inference Optimization, FM hosting, AI Platform) Overview: At Capital One, we are creating responsible and reliable AI systems, changing banking... ...AI software components including foundation model training, large language model inference, similarity search...
Training
Full time
Part time
Local area
Capital One
San Francisco, CA
1 day ago
AI Platform Engineer, Capabilities
...Brain Co. is an applied AI startup co-founded by Jared Kushner... ...Role: As a core backend engineer at Brain Co., you will build... ...system designs, and ship robust platforms that support real-world AI... ...artifact management, and automated training and evaluation pipelines....
Training
Remote work
Brainco
San Francisco, CA
4 days ago
Senior AI Platform Engineer
$180k - $200k
...CA - Hybrid (must be onsite 6 times per month) Title: Senior AI Platform Engineer Job Description We are building a next-generation Agentic AI... ...not limited to: the individual’s skill sets, experience and training; licensure and certification requirements; office location...
Training
Full time
Work at office
Local area
Vaco
San Francisco, CA
4 days ago
Distinguished AI Engineer (Agentic AI Platform)
$269.1k - $307.2k
...Distinguished AI Engineer (Agentic AI Platform) At Capital One, we are creating responsible and reliable... ...or technologies (e.g. LLM Inference, Similarity Search and VectorDBs, Guardrails... ...of-the-art techniques for optimizing training and inference software to improve hardware...
Training
Full time
Part time
Work at office
Local area
Capital One
San Francisco, CA
20 hours ago
Senior AI Infrastructure Engineer, Model Serving Platform
$216k - $270k
...As a Software Engineer on the ML Infrastructure team... ...will design and build platforms for scalable, reliable... ...LLM, or text-generation-inference. Compensation... ...relevant education or training. Scale employees in eligible... ...to develop reliable AI systems for the world'...
Training
Full time
Scale AI
San Francisco, CA
4 days ago
Senior Lead AI Engineer (Gen AI Platform Services, Agentic AI)
$229.9k - $262.4k
...Senior Lead AI Engineer (Gen AI Platform Services, Agentic AI) Overview: At Capital One, we are creating responsible and reliable... ...AI software components including foundation model training, large language model inference, similarity search, guardrails, model evaluation,...
Training
Full time
Part time
Local area
Capital One
San Francisco, CA
3 days ago
AI Infra Engineer: Scale ML Training & Inference
A leading AI technology firm in San Francisco is seeking an AI Infra Engineer to enhance their infrastructure. The successful candidate will design and maintain Kubernetes... ...clusters and manage Slurm for distributed training. Important skills include extensive experience...
Training
Perplexity
San Francisco, CA
3 days ago
AI Infrastructure Engineer Intern — Training & Inference
A leading AI fashion-tech company is seeking a Software Engineer Intern to focus on building infrastructure for AI systems. This role involves designing scalable models, developing APIs, and optimizing for performance and reliability. An ideal candidate will have a strong...
Training
Internship
Immediate start
SpreeAI
San Francisco, CA
2 days ago
Senior AI Data Platform Engineer
Shield AI, located in San Francisco, is seeking a Principal Engineer to lead the AI data platform efforts from training to deployment in diverse environments. This pivotal role involves scaling architecture across various autonomy programs and ensuring efficiency and reliability...
Training
jobs.frontdoordefense.com - Jobboard
San Francisco, CA
20 hours ago
Staff AI Platform Engineer
$178k - $267k
...intelligence and search built on proven AI, AlphaSense delivers insights... ...of content sets. Our platform is trusted by over 6,000... ...the Team Our diverse Product & Engineering team values innovation, collaboration... ...recruitment, hiring, training, advancement, and termination...
Training
Local area
BetterCloud
San Francisco, CA
3 days ago
Sr. Lead AI Engineer (GenAI Platform)
...responsible and reliable AI systems, changing banking... ...class applied science and engineering teams to deliver our... ...customers. Our AI models and platforms empower teams across Capital... ...foundation model training, large language model inference, similarity search, guardrails...
Training
Local area
Capital One National Association
San Francisco, CA
20 hours ago
Senior AI Platform Engineer
$192k - $264k
...intelligent underwriting engine that determines credit... ...--- we use data, AI, and machine learning to... ...continue to build our AI platform that will power our wholesale... ...a-service” ~ ML Model Training framework ~ Feature... ...model deployment, inference and monitoring ~ AI Agent...
Training
Work experience placement
Work at office
Local area
Remote work
2 days per week
Faire
San Francisco, CA
more than 2 months ago
Principal Engineer, AI And Data Platform Engineering (r4941)
$320k
Principal Engineer, AI And Data Platform Engineering (r4941) Own the AI data platform from training to deployment across on‑prem and cloud environments globally Location: San Francisco, California, United States Compensation: $320,000 - 490,000 USD / year Job Tags: Software...
Training
Full time
Temporary work
Part time
jobs.frontdoordefense.com - Jobboard
San Francisco, CA
20 hours ago
Distributed Systems Engineer, Data & Inference Platform
...About Us Most AI is frozen in place - it doesn't adapt... ...into useful intelligence - the inference services that serve LLMs at scale... ...both. Researchers and ML engineers will hand you workloads that... ...and curate the datasets behind training and evaluation. The...
Training
Flexible hours
Adaption
San Francisco, CA
7 days ago
Senior AI Inference Data Plane Engineer Remote
$167.2k - $209k
A leading cloud service provider is seeking a Senior Engineer 2 for their AI Inference Data Plane team. This remote role focuses on designing and developing high-scale, resilient data plane services that enhance AI-driven applications. The ideal candidate will have strong...
Remote work
DigitalOcean
San Francisco, CA
7 days ago
AI Inference Performance Engineer
Fathom is seeking a Model Performance Engineer in San Francisco to optimize the speed, cost, and reliability of its model inference stack while building fine-tuning infrastructure. The ideal candidate will have extensive experience with LLM frameworks, quantization techniques...
Fathom
San Francisco, CA
4 days ago
Senior AI Inference Engineer - GPU, Rust & CUDA
$220k
Perplexity is looking for an engineer to join their team in San Francisco. You will work on building and operating the inference engine, supporting new models, migrating GPU kernels, and developing a Rust-based serving runtime. The ideal candidate has 3+ years of experience...
Perplexity
San Francisco, CA
1 day ago
AI Platform Engineer, Infrastructure
...About Brain Co. Brain Co. is an applied AI startup co-founded by Jared Kushner and... ...workloads. Support high-performance inference, data pipelines, and large-scale backend... ...technical leaders on architecture and mentor engineers. You Might Be a Great Fit If You......
Brainco
San Francisco, CA
4 days ago
AI Platform Engineer, Agentic Engineering
...Brain Co. is an applied AI startup co-founded by Jared Kushner... ...across Brain Co. For every engineer, operator, and business team... ...critical industries. This is a platform role at the center of the company... ...infrastructure: gateways, inference systems, prompt routing, cost...
Brainco
San Francisco, CA
20 hours ago
Member of Technical Staff (AI Inference Engineer)
...Inference Engine Engineer We build and run the inference engine behind every Perplexity query and deploy dozens of model architectures at scale with tight latency and cost budgets. Our stack is Rust, Python, CUDA, and CuTe DSL - and we need another engineer to join...
Perplexity AI
San Francisco, CA
4 days ago
Applied AI Inference Engineer
ABOUT BASETEN Baseten powers mission‑critical inference for the world's most dynamic AI companies, like Cursor, Notion, OpenEvidence, Abridge, Clay... ..., and Conviction. Join us and help build the platform engineers turn to to ship AI products. THE ROLE As an Applied...
Work experience placement
Flexible hours
Baseten
San Francisco, CA
2 days ago
Staff AI Platform Engineer
...Staff AI Platform Engineer Laurel is on a mission to return time. As the leading AI Time platform for professional services firms, we're... ...platform to be worldclass. We already process millions of inferences per day, but to keep up with our growth, we need a platform...
Work at office
Remote work
Visa sponsorship
2 days per week
Laurel Property Services
San Francisco, CA
4 days ago
AI Engineer
$180k - $250k
...We're hiring a full-time AI Engineer to own the prompts, agents, evals... ...intersection of product and platform: you decide what the AI... ...systems Experience with post-training or fine-tuning Experience... ...frameworks for batch inference, eval pipelines, or scaling...
Training
Full time
Work at office
Remote work
Relocation
Fluency Corp
San Francisco, CA
20 hours ago
AI Engineer LLM Infra
...interact with the web by building AI agents that can reliably do... ...to be agent-first, from training our own models to generative... ...) Scale infra for agentic inference (throughput and latency of perception... ...Work closely with product engineers to translate cutting-edge AI...
Training
Work at office
Relocation
Visa sponsorship
Yutori
San Francisco, CA
19 days ago
AI Engineer, Multimodal LLMs
...Meet Eloquent AI At Eloquent AI, we're building the next... ...alongside world-class talent in AI, engineering, and product as we redefine... ...solutions Refine training paradigms for real-world applications... ..., including fine-tuning and inference optimization. ~ Familiarity...
Training
Eloquent AI
San Francisco, CA
20 hours ago
Senior AI Infrastructure Engineer - Training Platform
$216k - $270k
...As a Software Engineer on the Machine Learning Infrastructure team, you will build the "... .... You will architect a high-performance training platform that handles the immense complexity of multi... ...raw compute into breakthrough AI. You will: Architect and scale a multi...
Training
Full time
Scale AI
San Francisco, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Platform Engineer, Training and Inference. Be the first to apply!