Kubernetes Infra Ops Engineer — AI Fleet & Capacity
Baseten
A leading AI infrastructure company in San Francisco is seeking an Infrastructure Ops Engineer to manage the operational health of their GPU fleet. This role involves working closely with customer success and engineering teams to execute complex hardware lifecycles while ensuring the reliable performance of their platform. Candidates should possess strong skills in Kubernetes and a solid background in cloud infrastructure management. The position offers competitive pay, equity, and comprehensive benefits including medical coverage and generous PTO policies. #J-18808-Ljbffr Baseten
- ...Consensus in San Francisco is seeking an Infrastructure Ops Engineer to manage daily operations of our GPU fleet and maintain infrastructure excellence. This hands-... ...teams to ensure timely fulfillment of customer capacity requests. Candidates should have a Bachelor's or...Fleet
- A high-growth AI startup in San Francisco is seeking a Software Engineer (Infrastructure) to design and scale Kubernetes systems for a rapidly expanding platform. You will be responsible for leading technical deployments for enterprise clients and developing secure execution...Suggested
$150k - $300k
...We are looking for a AI Cloud Infra Engineer to join our infrastructure... ...relies heavily on Kubernetes (K8s), Terraform, and... ...managing GPU server capacity, observability, and... ...provider APIs to manage our fleet of GPU workers across... ...: dev, arts, prod, ops, etc (and no, there...Fleet- ...s most dynamic AI companies, like... ...build the platform engineers turn to to ship... ...Infrastructure Ops Engineer at... ...that power our fleet This role is designed... ...high-level capacity strategies are... ...hands-on with Kubernetes and cloud-... ...between SRE and Infra teams, executing...FleetWork experience placementWork at officeFlexible hours
$180k - $250k
A tech innovation company is looking for a hands-on engineer in San Francisco to manage a vast fleet of GPU servers. You will build systems for tracking server lifecycle, automate provisioning and health checks, and ensure OS-level security. The role requires 5+ years...Fleet$320k - $405k
...interpretable, and steerable AI systems. We want... ...researchers, engineers, policy experts,... ...Engineer, Node Infra About the role the... ...lifecycle of accelerator capacity at the company. We... ...node in the fleet usable and ready to... ...platforms (e.g., Kubernetes, IaC, AWS/GCP/Azure...FleetVisa sponsorship$127k - $249k
...The Team Platform Engineering is the department within... ...our multi-cloud-provider Kubernetes infrastructure,... ...alerting systems. The Fleet Management team provides... ...processes ("allergic to ops work") We are a small... ...data platform for the AI era, enabling builders...FleetWork at officeLocal areaRemote workWorldwideFlexible hours$293k
...the architectural and engineering backbone of OpenAI’s infrastructure... ...of cutting-edge AI models. Our work spans... ...architecture, fleet-level monitoring, and... ..., joined to the right Kubernetes control plane, registered... ...turning new SKUs into capacity that is usable by...FleetFull timeWork at officeLocal areaRelocation packageFlexible hours- ...the pioneering Causal AI platform. We help the world... ...we're looking for an engineer who wants to own the... ...reliability across the fleet. Improve system and network... ..., and proactive capacity planning. Implement and... ...peering. Familiarity with Kubernetes networking (CNI plugins...Fleet
$140.6k - $173.1k
...seasoned Staff Software Engineer in the North America... ...that focuses on building AI Platform to support the... ...Platform. Within this capacity, you will be responsible... ...strategic credit issuance to fleet organizations and their... ...such as Docker and Kubernetes) Awareness of API...FleetRemote workFlexible hours- Senior Site Reliability Engineer - AI Infrastructure Location: Global Remote /... ...degradation, NCCL timeouts). Own capacity planning across heterogeneous GPU fleets optimized for training... ...the syscall and hardware level. Kubernetes & Orchestration: Strong experience...FleetFull timeRemote work
- ...Site Reliability Engineer to ensure the reliability... ...scalability of our AI infrastructure... ...performance tuning, incident ops, infrastructure... ...the founders, the infra team, and the dev... ..., debugging, capacity planning, and failure... ...~ Experience with Kubernetes or similar orchestrators...
- ...that increase the speed and capacity of buildout, starting with data... ...foundational members of our Engineering team, you will design components... ...for our fast-growing fleet of production robots. We like... ...with Hardware Engineers and Ops team to support overall product...Fleet
- ...America's manufacturing base. Our AI-powered robots automate food prep... ...is looking for a Senior Software Engineer, Robotics Platform, to help us scale our fleet of robots. You will make a large... ...people in a tech lead or similar capacity. Chef Robotics is solving one of...Fleet
- A leading AI research company in San Francisco is seeking engineers to operate next-gen compute clusters. The role requires scaling Kubernetes, automating infrastructure, and ensuring system reliability. Ideal candidates have strong Kubernetes and scripting skills with...
$200k - $260k
...Inc. is seeking a founding infrastructure engineer based in San Francisco to rebuild their... ...core systems from scratch using AWS and Kubernetes. The ideal candidate will have extensive... ...revolutionizing the physical goods supply chain with AI-native solutions. #J-18808-Ljbffr...Full time- ...the world's most dynamic AI companies, like Cursor,... ...build the platform engineers turn to ship AI products... ...THE ROLE We’re hiring a Capacity and Infrastructure Analytics... ...across Baseten’s fleet. You’ll create reliable... ...Accounting, Product, and Ops stakeholders. Strong...FleetFlexible hours
- ...the world's most dynamic AI companies, like Cursor,... ...build the platform engineers turn to to ship AI products... ...how Baseten operates. Capacity helps unlock revenue by... ...Partner with SRE and Infra teams to ensure Capacity... ...operational reality of the fleet, not just the desired...FleetFlexible hours
- AI Systems Engineer - Codex Core Agents The Codex Core Agents team builds the agent harness that turns... ...envelope around tokens, latency, cost, capacity, and quality. The harness is open... ...behavior, inference/runtime stack, GPU fleet, and product surface. You’ll work with...Fleet
$200k - $260k
...from zero — AWS, Kubernetes, Golang, and... ...that the rest of engineering will build on... ...years. Founding infra seat with the architectural... ...building the AI-native... ....ts service fleet (API gateway, GraphQL... ...Reliability & ops — Backups, DR,... ...for tomorrow. Capacity planning. Cost...FleetFull timeWork at officeLocal area- Generalist is seeking a candidate to manage GPU fleets for training large-scale AI models. You will optimize ML data loading, storage, and orchestration... ...candidates have deep experience with GPUs, Slurm or Kubernetes, and a strong understanding of the ML hardware stack....Fleet
- ...Hyphen Connect in San Francisco is seeking a Robotics Software Engineer to design advanced algorithms and systems for optimizing robotic fleet efficiency. Your expertise will be critical in developing low-latency communication protocols and cloud dashboards for fleet monitoring...Fleet
$202.5k - $247.5k
...sharing localhost or running AI workloads in production... ..., AI inference, device fleets, and site‑to‑site... ...your time. About the Infra Platform Team The Infra... ...builds the systems ngrok engineers rely on to build,... ...Go, PostgreSQL, gRPC, Kubernetes, Terraform, Protobuf, nix...FleetPermanent employmentFull timeWork at officeLocal areaRemote workHome officeFlexible hours- ...build the foundation for agent engineering in the real world, helping... ...prototypes to production‑ready AI agents that teams can rely on... ..., Evaluation, Deployment, Fleet, and Sandboxes), our open‑source... ...), containers, and basic Kubernetes concepts Have shipped and operated...FleetWork at officeFlexible hours
- ...individual to build a high-performance macOS virtualization platform. This role involves managing the VM lifecycle and integrating with the fleet scheduler to optimize performance on Apple Silicon. The ideal candidate is curious and has hands-on experience in virtualization...FleetFlexible hours
- AI Systems Engineer - Codex Core Agents Location San Francisco Employment Type Full time Department Applied AI Compensation 230K-385... ...agent stack, from backend systems to inference, GPUs, and fleet capacity. Work closely with research to make the harness trainable...FleetFull timeWork at officeLocal areaRelocation packageFlexible hours
- ...are a non-hierarchical team seeking a highly experienced Dev-ops Engineer to collaborate with the technical team who specializes in blockchain... ..., utilizing containerisation technologies (e.g. Docker, Kubernetes) to automate the deployment process, ensuring seamless and...Remote workFlexible hours
$125k - $195k
...exceptional, hands-on engineers to make this happen. Mechanical... ...philosophy towards infra is minimal,... ...docker, cloud services, or kubernetes. Instead, there is a lot... ...deploy and manage our fleet of on-prem servers,... ...upon the applicant’s capacity to serve in compliance...FleetWork at officeVisa sponsorshipNight shift$255k - $405k
...Join the engineering teams that bring OpenAI's ideas... ...the benefits of AI, while ensuring that... ...metrics across our fleet. We're now layering... ...engineers across the stack-infra, backend, and... ...researchers, user ops, and other teams... ..., and cloud infra (Kubernetes, AWS, etc). Bonus...Fleet- ...world At Bedrock, we’re moving AI out of the lab and into the... ...construction veterans and world-class engineers to solve physical-world... ...hardware development and robotic fleet operations. This role sits at... ...Meeting with field ops on triage trends Reviewing top...FleetPermanent employmentContract workWork at officeFlexible hoursShift workNight shiftDay shift
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Kubernetes Infra Ops Engineer — AI Fleet & Capacity. Be the first to apply!
- commercial fleet sales San Francisco, CA
- fleet mechanic San Francisco, CA
- fleet maintenance San Francisco, CA
- fleet engineer San Francisco, CA
- fleet diesel mechanic San Francisco, CA
- fleet driver San Francisco, CA
- fleet service San Francisco, CA
- fleet San Francisco, CA
- fleet logistics San Francisco, CA
- fleet technician San Francisco, CA


