Kubernetes Infra Ops Engineer AI Fleet & Capacity
BaseTen
A leading AI infrastructure company in San Francisco is seeking an Infrastructure Ops Engineer to manage the operational health of their GPU fleet. This role involves working closely with customer success and engineering teams to execute complex hardware lifecycles while ensuring the reliable performance of their platform. Candidates should possess strong skills in Kubernetes and a solid background in cloud infrastructure management. The position offers competitive pay, equity, and comprehensive benefits including medical coverage and generous PTO policies. #J-18808-Ljbffr
- Baseten is seeking a Capacity Operations Associate in... ...support their global AI infrastructure. This role... ...involves managing GPU fleet maintenance and... ...in Computer Science or Engineering with over 2 years of experience... ...-facing roles, strong Kubernetes knowledge, and a...Fleet
- ...s most dynamic AI companies, like... ...build the platform engineers turn to to ship... ...Infrastructure Ops Engineer at... ...that power our fleet This role is designed... ...high-level capacity strategies are... ...hands-on with Kubernetes and cloud-native... ...SRE and Infra teams, executing...FleetWork experience placementWork at officeFlexible hours
- ...A high-growth AI startup in San Francisco is seeking a Software Engineer (Infrastructure) to design and scale Kubernetes systems for a rapidly expanding platform. You will be responsible for leading technical deployments for enterprise clients and developing secure execution...Suggested
$180k - $250k
...A tech innovation company is looking for a hands-on engineer in San Francisco to manage a vast fleet of GPU servers. You will build systems for tracking server lifecycle, automate provisioning and health checks, and ensure OS-level security. The role requires 5+ years...Fleet- ...infrastructure, or platform engineering, with at least 2... ...or tech-lead capacity) , Deep, hands-on... ...experience with Kubernetes (preferably EKS),... ...Experience supporting AI/ML or data heavy... ...workloads (GPU fleets, vector stores, large... ...quickly without infra friction , Maintain...Fleet
- ...builds general-purpose AI for the physical... ...a heterogeneous fleet of GPU and TPU... ...work closely with ML Infra (training systems)... ...- Strong software engineering fundamentals - Experience... ...systems (SLURM, Kubernetes, GKE, K3S, or... ...- Experience with capacity planning and cloud...FleetFlexible hours
$293k - $385k
...The Infrastructure Engineering function sits within IT... ...infrastructure provisioned through Infra Terraform, ensuring... ...platforms, and fleet systems, driving durable... ...platform services, including Kubernetes and Docker-based... ...OpenAI OpenAI is an AI research and deployment...FleetWork at office$293k
...the architectural and engineering backbone of OpenAI’s infrastructure... ...of cutting-edge AI models. Our work spans... ...architecture, fleet-level monitoring, and... ..., joined to the right Kubernetes control plane, registered... ...turning new SKUs into capacity that is usable by...FleetFull timeWork at officeLocal areaRelocation packageFlexible hours- ...Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote... ...degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training throughput... ...the syscall and hardware level. Kubernetes & Orchestration: Strong experience...FleetFull timeRemote work
$140.6k - $173.1k
...seasoned Staff Software Engineer in the North America... ...that focuses on building AI Platform to support the... ...Platform. Within this capacity, you will be responsible... ...strategic credit issuance to fleet organizations and their... ...such as Docker and Kubernetes) ~ Awareness of API...FleetRemote workFlexible hours$192k - $260k
...world’s best data and AI infrastructure platform... ...companies in the world. Our engineering teams build highly... ...software platforms. The fleet consists of millions of... ...and environments. Core Infra: Build the core infrastructure... .... We run thousands of Kubernetes clusters across all...FleetWork at officeLocal areaWorldwideFlexible hours- ...The TeamPlatform Engineering is the department within SRE... ...our multi-cloud-provider Kubernetes infrastructure,... ...and alerting systems.The Fleet Management team provides... ...processes ("allergic to ops work")We are a small team... ...redefined the database for the AI era, enabling...FleetWork at officeLocal areaRemote workWorldwideFlexible hours
- Baseten is seeking a Capacity Operations Associate to support their global AI infrastructure in San Francisco... ...and infrastructure engineering, collaborating closely... ...include: Managing GPU fleet maintenance Coordinating... ...engineering. Strong Kubernetes knowledge, attention to...Fleet
- ...the world's most dynamic AI companies, like Cursor,... ...build the platform engineers turn to ship AI products... ...THE ROLE We’re hiring a Capacity and Infrastructure Analytics... ...across Baseten’s fleet. You’ll create reliable... ...Accounting, Product, and Ops stakeholders. Strong...FleetFlexible hours
- TRM Labs is looking for a Senior or Staff ML Systems Engineer to focus on building and scaling the technical infrastructure for AI/ML systems in San Francisco. This position involves developing reusable CI/CD workflows and automating model versioning to ensure compliance...
$200k - $260k
...from zero — AWS, Kubernetes, Golang, and... ...that the rest of engineering will build on... ...years. Founding infra seat with the architectural... ...building the AI-native... ....ts service fleet (API gateway, GraphQL... ...Reliability & ops — Backups, DR,... ...for tomorrow. Capacity planning. Cost...FleetFull timeWork at officeLocal area$202.5k - $247.5k
...sharing localhost or running AI workloads in production... ..., AI inference, device fleets, and site‑to‑site... ...worth your time. About the Infra Platform Team The Infra... ...the systems ngrok engineers rely on to build, deploy... ...Go, PostgreSQL, gRPC, Kubernetes, Terraform, Protobuf, nix...FleetPermanent employmentFull timeWork at officeLocal areaRemote workHome officeFlexible hours$145k - $195k
...A tech-driven company in San Francisco seeks a Software Engineer for the Infra team. The ideal candidate will have 3+ years experience and a strong focus on Python and CI/CD processes. Responsibilities include developing testing strategies, enhancing developer productivity...- ...A leading AI research company in San Francisco is seeking engineers to operate next-gen compute clusters. The role requires scaling Kubernetes, automating infrastructure, and ensuring system reliability. Ideal candidates have strong Kubernetes and scripting skills with...
$200k - $260k
...Inc. is seeking a founding infrastructure engineer based in San Francisco to rebuild their... ...core systems from scratch using AWS and Kubernetes. The ideal candidate will have extensive... ...revolutionizing the physical goods supply chain with AI-native solutions. #J-18808-Ljbffr...Full time- ...class Site Reliability Engineer to ensure the... ...scalability of our AI infrastructure platform... ...tuning, incident ops, infrastructure health... ...the founders, the infra team, and the dev... ...operations, debugging, capacity planning, and... ...networking) Experience with Kubernetes or similar...
- ...AI Systems Engineer - Codex Core Agents About The Team The Codex Core Agents team builds... ...envelope around tokens, latency, cost, capacity, and quality. The harness is open source... ...behavior, inference/runtime stack, GPU fleet, and product surface. You'll work with...Fleet
- ...America's manufacturing base. Our AI-powered robots automate food prep... ...is looking for a Senior Software Engineer, Robotics Platform, to help us scale our fleet of robots. You will make a large... ...people in a tech lead or similar capacity. Chef Robotics is solving one of...Fleet
- ...world At Bedrock, we’re moving AI out of the lab and into the... ...construction veterans and world-class engineers to solve physical-world... ...We’re building out our first fleet of retrofitted autonomous construction... ...to week, supporting relevant ops as needed, with potential for...FleetTemporary workWork at officeRemote workFlexible hoursNight shiftWeekend work
- ...About Us Most AI is frozen in place - it doesn't adapt... ...about both. Researchers and ML engineers will hand you workloads that... ...cost across heterogeneous GPU fleets. Batching, scheduling, KV cache... ...by. ~ Experience operating Kubernetes-based infrastructure,...FleetFlexible hours
- ...individual to build a high-performance macOS virtualization platform. This role involves managing the VM lifecycle and integrating with the fleet scheduler to optimize performance on Apple Silicon. The ideal candidate is curious and has hands-on experience in virtualization...FleetFlexible hours
$179k - $218k
...Senior Staff Data Center Operations Engineer, GPU Hardware Architecture... ...the only vertically integrated AI infrastructure company built from... ...AI/ML methodologies to analyze fleet-wide telemetry (power draws,... ...diagnostic tooling that allows Site Ops to identify NVLink flapping,...FleetTemporary work$125k - $195k
...exceptional, hands-on engineers to make this happen. Mechanical... ...philosophy towards infra is minimal,... ...docker, cloud services, or kubernetes. Instead, there is a lot... ...deploy and manage our fleet of on-prem servers,... ...upon the applicant’s capacity to serve in compliance...FleetWork at officeVisa sponsorshipNight shift- AI Systems Engineer - Codex Core Agents Location San Francisco Employment Type Full time Department Applied AI Compensation 230K-385... ...agent stack, from backend systems to inference, GPUs, and fleet capacity. Work closely with research to make the harness trainable...FleetFull timeWork at officeLocal areaRelocation packageFlexible hours
$165k - $315k
...build the foundation for agent engineering in the real world, helping... ...prototypes to production-ready AI agents that teams can rely on... ..., Evaluation, Deployment, Fleet, and Sandboxes), our open source... ...), containers, and basic Kubernetes concepts Have shipped and operated...FleetWork at officeFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Kubernetes Infra Ops Engineer AI Fleet & Capacity. Be the first to apply!
- fleet San Francisco, CA
- fleet mechanic San Francisco, CA
- fleet driver San Francisco, CA
- fleet maintenance San Francisco, CA
- fleet diesel mechanic San Francisco, CA
- fleet technician San Francisco, CA
- fleet service San Francisco, CA
- fleet logistics San Francisco, CA
- fleet engineer San Francisco, CA
- commercial fleet sales San Francisco, CA

