Kubernetes Infra Ops Engineer — AI Fleet & Capacity
Baseten
A leading AI infrastructure company in San Francisco is seeking an Infrastructure Ops Engineer to manage the operational health of their GPU fleet. This role involves working closely with customer success and engineering teams to execute complex hardware lifecycles while ensuring the reliable performance of their platform. Candidates should possess strong skills in Kubernetes and a solid background in cloud infrastructure management. The position offers competitive pay, equity, and comprehensive benefits including medical coverage and generous PTO policies. #J-18808-Ljbffr Baseten
- Baseten is seeking a Capacity Operations Associate in... ...support their global AI infrastructure. This role... ...involves managing GPU fleet maintenance and... ...in Computer Science or Engineering with over 2 years of experience... ...-facing roles, strong Kubernetes knowledge, and a...Fleet
- ...s most dynamic AI companies, like... ...build the platform engineers turn to to ship... ...Infrastructure Ops Engineer at... ...that power our fleet This role is... ...that high-level capacity strategies are... ...hands-on with Kubernetes and cloud-native... ...between SRE and Infra teams,...FleetWork experience placementWork at officeFlexible hours
$200k - $300k
...A technology infrastructure company is seeking a System Engineer, GPU Fleet, to manage and optimize GPU compute infrastructure for AI/ML workloads. You will ensure high availability and performance of the GPU server fleet through monitoring, troubleshooting, and automation...Fleet- A high-growth AI startup in San Francisco is seeking a Software Engineer (Infrastructure) to design and scale Kubernetes systems for a rapidly expanding platform. You will be responsible for leading technical deployments for enterprise clients and developing secure execution...Suggested
- ...builds general-purpose AI for the physical... ...a heterogeneous fleet of GPU and TPU... ...work closely with ML Infra (training systems)... ...Strong software engineering fundamentals Experience... ...systems (SLURM, Kubernetes, GKE, K3S, or... ...Experience with capacity planning and cloud...Fleet
$180k - $250k
A tech innovation company is looking for a hands-on engineer in San Francisco to manage a vast fleet of GPU servers. You will build systems for tracking server lifecycle, automate provisioning and health checks, and ensure OS-level security. The role requires 5+ years...Fleet$293k - $385k
...The Infrastructure Engineering function sits within IT... ...infrastructure provisioned through Infra Terraform, ensuring... ...platforms, and fleet systems, driving durable... ...platform services, including Kubernetes and Docker-based... ...OpenAI OpenAI is an AI research and deployment...FleetWork at office$140.6k - $173.1k
...seasoned Staff Software Engineer in the North America... ...that focuses on building AI Platform to support the... ...Platform. Within this capacity, you will be responsible... ...strategic credit issuance to fleet organizations and their... ...such as Docker and Kubernetes) ~ Awareness of API...FleetRemote workFlexible hours- Baseten is seeking a Capacity Operations Associate to support their global AI infrastructure in San Francisco... ...and infrastructure engineering, collaborating closely... ...include: Managing GPU fleet maintenance Coordinating... ...engineering. Strong Kubernetes knowledge, attention to...Fleet
- Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote /... ...degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training... ...the syscall and hardware level. Kubernetes & Orchestration: Strong experience...FleetFull timeRemote work
$293k
...the architectural and engineering backbone of OpenAI’s infrastructure... ...of cutting-edge AI models. Our work spans... ...architecture, fleet-level monitoring, and... ..., joined to the right Kubernetes control plane, registered... ...turning new SKUs into capacity that is usable by...FleetFull timeWork at officeLocal areaRelocation packageFlexible hours- TRM Labs is looking for a Senior or Staff ML Systems Engineer to focus on building and scaling the technical infrastructure for AI/ML systems in San Francisco. This position involves developing reusable CI/CD workflows and automating model versioning to ensure compliance...
$202.5k - $247.5k
...sharing localhost or running AI workloads in production... ..., AI inference, device fleets, and site‑to‑site... ...worth your time. About the Infra Platform Team The Infra... ...the systems ngrok engineers rely on to build, deploy... ...Go, PostgreSQL, gRPC, Kubernetes, Terraform, Protobuf, nix...FleetPermanent employmentFull timeWork at officeLocal areaRemote workHome officeFlexible hours- ...the world's most dynamic AI companies, like Cursor,... ...build the platform engineers turn to ship AI products... ...THE ROLE We’re hiring a Capacity and Infrastructure Analytics... ...across Baseten’s fleet. You’ll create reliable... ...Accounting, Product, and Ops stakeholders. Strong...FleetFlexible hours
$200k - $260k
...from zero — AWS, Kubernetes, Golang, and... ...that the rest of engineering will build on... ...years. Founding infra seat with the architectural... ...building the AI-native... ....ts service fleet (API gateway, GraphQL... ...Reliability & ops — Backups, DR,... ...for tomorrow. Capacity planning. Cost...FleetFull timeWork at officeLocal area- ...AI Systems Engineer - Codex Core Agents About The Team The Codex Core Agents team builds... ...envelope around tokens, latency, cost, capacity, and quality. The harness is open source... ...behavior, inference/runtime stack, GPU fleet, and product surface. You'll work with...Fleet
$145k - $195k
...A tech-driven company in San Francisco seeks a Software Engineer for the Infra team. The ideal candidate will have 3+ years experience and a strong focus on Python and CI/CD processes. Responsibilities include developing testing strategies, enhancing developer productivity...- ...America's manufacturing base. Our AI-powered robots automate food prep... ...is looking for a Senior Software Engineer, Robotics Platform, to help us scale our fleet of robots. You will make a large... ...people in a tech lead or similar capacity. Chef Robotics is solving one of...Fleet
- ...About Us Most AI is frozen in place - it doesn't adapt... ...about both. Researchers and ML engineers will hand you workloads that... ...cost across heterogeneous GPU fleets. Batching, scheduling, KV cache... ...by. ~ Experience operating Kubernetes-based infrastructure,...FleetFlexible hours
- A leading AI research company in San Francisco is seeking engineers to operate next-gen compute clusters. The role requires scaling Kubernetes, automating infrastructure, and ensuring system reliability. Ideal candidates have strong Kubernetes and scripting skills with...
$200k - $260k
...Inc. is seeking a founding infrastructure engineer based in San Francisco to rebuild their... ...core systems from scratch using AWS and Kubernetes. The ideal candidate will have extensive... ...revolutionizing the physical goods supply chain with AI-native solutions. #J-18808-Ljbffr...Full time$179k - $218k
...Senior Staff Data Center Operations Engineer, GPU Hardware Architecture... ...the only vertically integrated AI infrastructure company built from... ...AI/ML methodologies to analyze fleet-wide telemetry (power draws,... ...diagnostic tooling that allows Site Ops to identify NVLink flapping,...FleetTemporary work- ...build the foundation for agent engineering in the real world, helping... ...prototypes to production-ready AI agents that teams can rely on... ..., Evaluation, Deployment, Fleet, and Sandboxes), our open source... ...), containers, and basic Kubernetes concepts Have shipped and operated...FleetWork at officeFlexible hours
- ...class Site Reliability Engineer to ensure the... ...scalability of our AI infrastructure platform... ...tuning, incident ops, infrastructure health... ...the founders, the infra team, and the dev... ...operations, debugging, capacity planning, and... ...) Experience with Kubernetes or similar orchestrators...
- ...world At Bedrock, we’re moving AI out of the lab and into the... ...construction veterans and world-class engineers to solve physical-world... ...We’re building out our first fleet of retrofitted autonomous construction... ...to week, supporting relevant ops as needed, with potential for...FleetTemporary workWork at officeRemote workFlexible hoursNight shiftWeekend work
- AI Systems Engineer - Codex Core Agents Location San Francisco Employment Type Full time Department Applied AI Compensation 230K-385... ...agent stack, from backend systems to inference, GPUs, and fleet capacity. Work closely with research to make the harness trainable...FleetFull timeWork at officeLocal areaRelocation packageFlexible hours
- ...individual to build a high-performance macOS virtualization platform. This role involves managing the VM lifecycle and integrating with the fleet scheduler to optimize performance on Apple Silicon. The ideal candidate is curious and has hands-on experience in virtualization...FleetFlexible hours
$125k - $195k
...exceptional, hands-on engineers to make this happen. Mechanical... ...philosophy towards infra is minimal,... ...docker, cloud services, or kubernetes. Instead, there is a lot... ...deploy and manage our fleet of on-prem servers,... ...upon the applicant’s capacity to serve in compliance...FleetWork at officeVisa sponsorshipNight shift$114k - $144k
Who we are Pronto AI is a global leader in commercializing autonomous... ...across our autonomous truck fleet. This role focuses on... ...immediate problems, but building our capacity to diagnose and resolve issues... ...closely with experienced engineers and industry operators. What...FleetFull timeWork experience placementInternshipWork at officeImmediate startWorldwideFlexible hours- ...company in San Francisco is seeking a Backend Software Engineer to own core parts of the software stack for robotic arms. Responsibilities include architecting systems for AI models, maintaining tooling for fleet management, and supporting deployment on real hardware....Fleet
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Kubernetes Infra Ops Engineer — AI Fleet & Capacity. Be the first to apply!
- fleet San Francisco, CA
- fleet mechanic San Francisco, CA
- fleet driver San Francisco, CA
- fleet maintenance San Francisco, CA
- fleet diesel mechanic San Francisco, CA
- fleet technician San Francisco, CA
- fleet service San Francisco, CA
- fleet logistics San Francisco, CA
- fleet engineer San Francisco, CA
- commercial fleet sales San Francisco, CA

