AI Platform Engineer, Training and Inference
Medium
AI Platform Engineer – Training & Inference Saviynt's AI-powered identity platform manages and governs human and non-human access to all of an organization's applications, data, and business processes. Customers trust Saviynt to safeguard their digital assets, drive operational efficiency, and reduce compliance costs. Built for the AI age, Saviynt is today helping organizations safely accelerate their deployment and usage of AI. Saviynt is recognized as the leader in identity security, with solutions that protect and empower the world's leading brands, Fortune 500 companies and government institutions. For more information, please visit The AI Platform team is building the compute layer that trains, evaluates, and serves every AI model at Saviynt. We need an ML Platform Engineer to own distributed training on Ray +H100s, the multi-engine LLM inference mesh (vLLM, SGLang, NVIDIA Triton), and the full model promotion lifecycle—from shadow mode through canary rollout to GA. The AI Platform team's mission is to build a secure, scalable, product-agnostic AI foundation that enables Saviynt's identity products to deliver measurable AI-powered outcomes. Training & Inference is the engine—it turns data into deployed models that make Saviynt's products smarter. What You Will Be Doing Own the Ray ecosystem end-to-end: manage KubeRay on GKE, tune Ray Core/Task/Actor scheduling, operate the Plasma distributed object store, and configure RayData for GPU-direct streaming from GCS/S3. Operate distributed training with Ray Train: configure TorchTrainer + DDP/NCCL for multi-node H100 clusters, manage checkpoint lifecycle, implement spot-preemption recovery, and integrate warm-start fine-tuning for retrain pipelines. Build and operate the LLM inference mesh with Ray Serve: compose vLLM (PagedAttention), SGLang (RadixAttention), and NVIDIA Triton (TensorRT/ONNX) as a unified deployment graph with Plasma zero-copy memory sharing. Optimise inference performance: configure fractional GPU allocation, enable continuous batching, implement per-engine autoscaling based on request queue depth, and tune KV-cache block sizes. Design and operate the model routing layer: capability-based, version-based, and tenant-based routing with cost-aware fallback between self-hosted SLMs and cloud LLMs. Build RL training infrastructure: define Flyte workflows for RL pipelines (rollout, reward shaping, policy update, evaluation), integrate Ray RLlib or custom PPO/GRPO loops with Ray Train, and manage replay buffer persistence on GCS. Operate the full model promotion lifecycle: quality gate to integration tests to load tests (k6) to shadow mode to A/B gate to canary (10% to 100%) with golden-signal auto-rollback. Operate the retrain pipeline: drift detection triggers, warm-start retraining, relative quality gates (V2 ≥ V1 - 2%), and automated Flyte DAG through to canary. Integrate RAG retrieval into the inference mesh: vector similarity search, context assembly, and prompt construction before LLM inference. What You Bring Experience in ML engineering with time in an ML platform or MLOps role. Production Ray depth: Ray Train, Serve, Core, and Data—debugged real production failures including NCCL timeouts, Plasma OOM, and Serve autoscaling lag. LLM serving engines: hands-on with vLLM, SGLang, or NVIDIA Triton—PagedAttention, prefix caching, and continuous batching tuned for latency/throughput targets. Distributed training: DDP, FSDP, NCCL collectives, gradient checkpointing, and mixed-precision (BF16/FP8). RL working knowledge: PPO, policy gradient, or RLHF—able to translate an algorithm into distributed compute primitives. Model lifecycle operations: MLflow registry, shadow/A/B/canary patterns, and auto-rollback on golden-signal degradation. Vector databases: Pgvector or Qdrant—ANN index strategies, embedding upsert, and query latency tuning under inference load. Strong Python and PyTorch; Flyte or equivalent ML orchestrator. Quantization (nice to have): INT8/INT4/FP8 post-training quantization (GPTQ, AWQ, or bitsandbytes). Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience or equivalent military experience. We offer you a competitive total rewards package, learning and tremendous opportunities to grow and advance in your career. At Saviynt, it is not typical for an individual to be hired at or near the top of the range for their role and final compensation decisions are dependent on many factors including, but not limited to, location; skill sets; experience and training; licensure and certifications; and other relevant business and organizational needs. You may also be eligible to participate in a Saviynt discretionary bonus plan, subject to the rules governing the program, whereby an award, if any, depends on various factors, including, without limitation, individual and organizational performance. #J-18808-Ljbffr Medium
$197.3k - $225.1k
...Lead AI Engineer (FM Hosting, LLM Inference) Overview At Capital One, we are creating responsible and... ...millions of customers. Our AI models and platforms empower teams across Capital One to... ...including foundation model training, large language model inference, similarity...TrainingFull timePart timeLocal area- An innovative AI company is seeking a Software Engineer to develop infrastructure that supports AI training and inference workflows. This role requires strong object-oriented programming skills and a solid foundation in data structures and algorithms. The ideal candidate...Training
- Saviynt, located in San Francisco, is seeking an AI Platform Engineer to manage and optimize the training and inference of AI models. You will lead efforts in operating the Ray ecosystem and distributed training on advanced GPU clusters. The ideal candidate has a solid...Training
$151.8k - $265.35k
...content with ease. The AI Foundations team builds the core AI platform that powers creativity across... .... We’re looking for an engineer to help develop and... ...including model integration, inference services, data pipelines... ...inference pipelines for training, evaluation, fine-tuning...TrainingFull timeTemporary workLocal areaWorldwide$229.9k - $262.4k
Sr. Lead AI Engineer (Inference Optimization, FM hosting, AI Platform) Overview: At Capital One, we are creating responsible and reliable AI systems, changing banking... ...AI software components including foundation model training, large language model inference, similarity search...TrainingFull timePart timeLocal area- ...Brain Co. is an applied AI startup co-founded by Jared Kushner... ...Role: As a core backend engineer at Brain Co., you will build... ...system designs, and ship robust platforms that support real-world AI... ...artifact management, and automated training and evaluation pipelines....TrainingRemote work
$180k - $200k
...CA - Hybrid (must be onsite 6 times per month) Title: Senior AI Platform Engineer Job Description We are building a next-generation Agentic AI... ...not limited to: the individual’s skill sets, experience and training; licensure and certification requirements; office location...TrainingFull timeWork at officeLocal area$269.1k - $307.2k
...Distinguished AI Engineer (Agentic AI Platform) At Capital One, we are creating responsible and reliable... ...or technologies (e.g. LLM Inference, Similarity Search and VectorDBs, Guardrails... ...of-the-art techniques for optimizing training and inference software to improve hardware...TrainingFull timePart timeWork at officeLocal area$216k - $270k
...As a Software Engineer on the ML Infrastructure team... ...will design and build platforms for scalable, reliable... ...LLM, or text-generation-inference. Compensation... ...relevant education or training. Scale employees in eligible... ...to develop reliable AI systems for the world'...TrainingFull time$229.9k - $262.4k
...Senior Lead AI Engineer (Gen AI Platform Services, Agentic AI) Overview: At Capital One, we are creating responsible and reliable... ...AI software components including foundation model training, large language model inference, similarity search, guardrails, model evaluation,...TrainingFull timePart timeLocal area- A leading AI technology firm in San Francisco is seeking an AI Infra Engineer to enhance their infrastructure. The successful candidate will design and maintain Kubernetes... ...clusters and manage Slurm for distributed training. Important skills include extensive experience...Training
- A leading AI fashion-tech company is seeking a Software Engineer Intern to focus on building infrastructure for AI systems. This role involves designing scalable models, developing APIs, and optimizing for performance and reliability. An ideal candidate will have a strong...TrainingInternshipImmediate start
- Shield AI, located in San Francisco, is seeking a Principal Engineer to lead the AI data platform efforts from training to deployment in diverse environments. This pivotal role involves scaling architecture across various autonomy programs and ensuring efficiency and reliability...Training
$178k - $267k
...intelligence and search built on proven AI, AlphaSense delivers insights... ...of content sets. Our platform is trusted by over 6,000... ...the Team Our diverse Product & Engineering team values innovation, collaboration... ...recruitment, hiring, training, advancement, and termination...TrainingLocal area- ...responsible and reliable AI systems, changing banking... ...class applied science and engineering teams to deliver our... ...customers. Our AI models and platforms empower teams across Capital... ...foundation model training, large language model inference, similarity search, guardrails...TrainingLocal area
$192k - $264k
...intelligent underwriting engine that determines credit... ...--- we use data, AI, and machine learning to... ...continue to build our AI platform that will power our wholesale... ...a-service” ~ ML Model Training framework ~ Feature... ...model deployment, inference and monitoring ~ AI Agent...TrainingWork experience placementWork at officeLocal areaRemote work2 days per week$320k
Principal Engineer, AI And Data Platform Engineering (r4941) Own the AI data platform from training to deployment across on‑prem and cloud environments globally Location: San Francisco, California, United States Compensation: $320,000 - 490,000 USD / year Job Tags: Software...TrainingFull timeTemporary workPart time- ...About Us Most AI is frozen in place - it doesn't adapt... ...into useful intelligence - the inference services that serve LLMs at scale... ...both. Researchers and ML engineers will hand you workloads that... ...and curate the datasets behind training and evaluation. The...TrainingFlexible hours
$167.2k - $209k
A leading cloud service provider is seeking a Senior Engineer 2 for their AI Inference Data Plane team. This remote role focuses on designing and developing high-scale, resilient data plane services that enhance AI-driven applications. The ideal candidate will have strong...Remote work- Fathom is seeking a Model Performance Engineer in San Francisco to optimize the speed, cost, and reliability of its model inference stack while building fine-tuning infrastructure. The ideal candidate will have extensive experience with LLM frameworks, quantization techniques...
$220k
Perplexity is looking for an engineer to join their team in San Francisco. You will work on building and operating the inference engine, supporting new models, migrating GPU kernels, and developing a Rust-based serving runtime. The ideal candidate has 3+ years of experience...- ...About Brain Co. Brain Co. is an applied AI startup co-founded by Jared Kushner and... ...workloads. Support high-performance inference, data pipelines, and large-scale backend... ...technical leaders on architecture and mentor engineers. You Might Be a Great Fit If You......
- ...Brain Co. is an applied AI startup co-founded by Jared Kushner... ...across Brain Co. For every engineer, operator, and business team... ...critical industries. This is a platform role at the center of the company... ...infrastructure: gateways, inference systems, prompt routing, cost...
- ...Inference Engine Engineer We build and run the inference engine behind every Perplexity query and deploy dozens of model architectures at scale with tight latency and cost budgets. Our stack is Rust, Python, CUDA, and CuTe DSL - and we need another engineer to join...
- ABOUT BASETEN Baseten powers mission‑critical inference for the world's most dynamic AI companies, like Cursor, Notion, OpenEvidence, Abridge, Clay... ..., and Conviction. Join us and help build the platform engineers turn to to ship AI products. THE ROLE As an Applied...Work experience placementFlexible hours
- ...Staff AI Platform Engineer Laurel is on a mission to return time. As the leading AI Time platform for professional services firms, we're... ...platform to be worldclass. We already process millions of inferences per day, but to keep up with our growth, we need a platform...Work at officeRemote workVisa sponsorship2 days per week
$180k - $250k
...We're hiring a full-time AI Engineer to own the prompts, agents, evals... ...intersection of product and platform: you decide what the AI... ...systems Experience with post-training or fine-tuning Experience... ...frameworks for batch inference, eval pipelines, or scaling...TrainingFull timeWork at officeRemote workRelocation- ...interact with the web by building AI agents that can reliably do... ...to be agent-first, from training our own models to generative... ...) Scale infra for agentic inference (throughput and latency of perception... ...Work closely with product engineers to translate cutting-edge AI...TrainingWork at officeRelocationVisa sponsorship
- ...Meet Eloquent AI At Eloquent AI, we're building the next... ...alongside world-class talent in AI, engineering, and product as we redefine... ...solutions Refine training paradigms for real-world applications... ..., including fine-tuning and inference optimization. ~ Familiarity...Training
$216k - $270k
...As a Software Engineer on the Machine Learning Infrastructure team, you will build the "... .... You will architect a high-performance training platform that handles the immense complexity of multi... ...raw compute into breakthrough AI. You will: Architect and scale a multi...TrainingFull time
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to AI Platform Engineer, Training and Inference. Be the first to apply!
- ai research engineer San Francisco, CA
- machine learning ai engineer San Francisco, CA
- ai engineer remote San Francisco, CA
- ai prompt engineer San Francisco, CA
- ai developer San Francisco, CA
- ai engineer San Francisco, CA
- ai ml engineer San Francisco, CA
- senior ai engineer San Francisco, CA
- platform engineering manager San Francisco, CA
- platform engineer San Francisco, CA

