Infrastructure / Cluster Engineer

Gimlet Labs

About Us

Gimlet is building the next generation of AI infrastructure: large-scale AI datacenters and the orchestration platform that coordinates them.

The future of AI will require vastly more compute than exists today. But as AI workloads become more complex and new hardware architectures emerge, simply deploying more GPUs isn't enough. The challenge is making increasingly diverse compute work together.

Gimlet's platform intelligently partitions and routes workloads across heterogeneous hardware, enabling step-function improvements in performance and efficiency. Customers deploy through production-grade APIs without needing to think about hardware selection, placement, or optimization.

We work with foundation labs, hyperscalers, and AI-native companies to power production workloads at massive scale and help define the infrastructure layer for the future of AI.
About this Role

We are looking for an Infrastructure / Cluster Engineer to design, build, and operate the cluster infrastructure behind Gimlet's heterogeneous inference cloud. Unlike traditional cloud platforms built around a single hardware ecosystem, Gimlet's infrastructure spans multiple accelerator vendors and architectures. Infrastructure engineers play a key role in bringing new hardware platforms online, building the operational abstractions that make heterogeneous infrastructure manageable at scale, and ensuring new silicon can serve production workloads reliably from day one.

This role is highly hands-on. You will work across bare metal, Linux, Kubernetes or cluster schedulers, high-speed networking, observability, provisioning, and incident response. You will partner closely with distributed systems, runtime, compiler, and hardware teams to ensure Gimlet's infrastructure can support demanding AI workloads at production scale.

What you will work on

Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference.
Build automation for provisioning, configuration, upgrades, validation, and lifecycle management.
Design and scale provisioning systems for heterogeneous bare-metal infrastructure across multiple datacenters and hardware vendors.Operate cluster scheduling, resource allocation, isolation, quotas, and utilization systems.
Debug complex production issues across Linux, networking, storage, drivers, firmware, and orchestration layers.
Build and operate high-performance networking infrastructure, including RDMA-enabled environments and accelerator interconnects.
Build observability for cluster health, capacity, performance, failures, and workload behavior.
Improve reliability, availability, and recovery across multi-node production systems.
Work with distributed systems and runtime teams to support low-latency, high-throughput inference workloads.
Evaluate and integrate new hardware platforms, accelerators, networking technologies, and datacenter designs.
Create runbooks, operational standards, and incident response practices as the fleet scales.

You may be a good fit if

Experience in infrastructure, cluster engineering, platform engineering, SRE, HPC, or distributed systems.
Deep Linux systems experience, including debugging performance, networking, storage, processes, and kernel-level issues.
Experience operating Kubernetes, Slurm, Nomad, or similar orchestration and scheduling systems.
Strong automation skills using tools such as Terraform, Ansible, Helm, Python, Go, or equivalent.
Experience with GPU or accelerator infrastructure, including drivers, firmware, CUDA/ROCm stacks, or hardware validation.
Familiarity with high-performance networking such as InfiniBand, RoCE, high-speed Ethernet, or datacenter fabrics.
Strong operational judgment: you know how to build systems that are observable, recoverable, and boring in production.
Comfort working in a fast-moving startup environment with high ownership and ambiguity.

Strong candidates may also have

Experience building or operating AI inference, training, HPC, or neocloud infrastructure.
Experience with bare-metal provisioning, PXE/iPXE, image pipelines, BIOS/firmware management, or rack bring-up.
Experience with multi-tenant cluster isolation, quota systems, fair scheduling, or usage accounting.
Experience debugging distributed workload performance across compute, memory, network, and storage bottlenecks.
Experience building observability platforms using technologies such as Prometheus, OpenTelemetry, Grafana, or similar tooling.
Familiarity with heterogeneous hardware environments across NVIDIA, AMD, Intel, ARM, or emerging accelerators.

Apply

Vacancy posted 2 days ago

Similar jobs that could be interesting for youBased on the Infrastructure / Cluster Engineer in San Francisco, CA vacancy

Infrastructure / Cluster Engineer
Gimlet is building AI infrastructure and orchestration platforms for large-scale AI datacenters. This Infrastructure/Cluster Engineer role involves designing, building, and operating heterogeneous cluster infrastructure that intelligently routes workloads across diverse...
Suggested
Linuxcareers
San Francisco, CA
5 days ago
Staff+ Infrastructure Engineer, Cluster Infrastructure
$320k - $405k
...Staff Engineer, Cluster Infrastructure Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers...
Suggested
Work at office
Visa sponsorship
Flexible hours
Colorwave Inc
San Francisco, CA
1 day ago
Senior Data Center Network Engineer - GPU Clusters
Baseten is hiring a Network Engineer (Data Centers) in San Francisco to design and own the high-performance network infrastructure for their GPU clusters. This senior role collaborates closely with hardware and platform teams, directly impacting model performance and inference...
Suggested
Flexible hours
Baseten
San Francisco, CA
2 days ago
Frontend Engineer - Cluster OS Platform
...the Role A well-funded early-stage Kubernetes infrastructure company is hiring a Frontend Engineer to design and build the interface for their flagship... ...visualizations, and real-time UIs that translate complex cluster state into clear, actionable experiences. This is a...
Suggested
Remote work
Clera
San Francisco, CA
4 days ago
HPC Cloud Engineer - AI Clusters & Automation
Neura Market is seeking an HPC Engineer to build and configure large-scale HPC clusters for AI workloads. This role requires working 4 days a week onsite in San Francisco/Bellevue, where you will collaborate closely with teams to troubleshoot and improve systems. The ideal...
Suggested
Neura Market
San Francisco, CA
5 days ago
Software Engineer, Frontier Clusters Infrastructure
$230k
...models. About the Role We are looking for engineers to operate the next generation of compute clusters that power OpenAI's frontier research. This... ...blends distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale...
OpenAI
San Francisco, CA
3 days ago
Head of Platform/AI Cluster Management - System Integrator
...collaboration through cutting-edge platforms that empower enterprises to evolve intelligently. The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic development, integration, and optimization of AI and platform initiatives. The role will focus...
Hamilton Barnes Associates Limited
San Francisco, CA
5 days ago
Staff Network Deployment Engineer, Lab
$193k - $234k
...Staff Network Deployment Engineer Crusoe is on a mission to accelerate the abundance... .... As the only vertically integrated AI infrastructure company built from the ground up, we own... ...-performance networks for GPU compute clusters. As we rapidly expand our footprint of...
Temporary work
Work at office
Remote work
G2 Venture Partners
San Francisco, CA
1 day ago
Research Networking Systems Software Engineer
$131.76k - $161.06k
...Software Engineer ESnet delivers high-bandwidth, reliable networking that connects... ...network and DOE's Integrated Research Infrastructure. As part of ESnet's Pilots and Prototypes... ...CD pipelines, and cloud-native compute clusters for R&D and prototype environments....
Full time
Work at office
Remote work
Berkely Lab
San Francisco, CA
2 days ago
Infrastructure Engineer (Observability)
$180k - $200k
...Infrastructure Engineer (Observability) Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform... ...DCGM Experience monitoring large-scale GPU or HPC clusters Familiarity with InfiniBand fabric observability Experience...
Remote work
Work from home
Flexible hours
Lightning AI
San Francisco, CA
7 days ago
Infrastructure Engineer
$165k - $200k
...Infrastructure Engineer As a member of our infrastructure team, you'll be at the heart of a fast-paced startup environment. Your primary focus... ...our cloud architecture, databases, file storage, search clusters, microservices, and machine learning pipelines. You'll work...
Work from home
Roboflow
San Francisco, CA
4 days ago
Infrastructure Ops Engineer
...uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable... .... Join us and help build the platform engineers turn to to ship AI products. THE... ...operational excellence. You aren't just managing clusters; you are acting as the technical glue...
Work experience placement
Work at office
Flexible hours
Baseten
San Francisco, CA
1 day ago
Compute Infrastructure Engineer, AI and Advanced Computing Center
$150k - $170k
...Infrastructure Engineer Schmidt Sciences is a nonprofit organization founded in 2024 by Eric and Wendy Schmidt that works to accelerate scientific... ...management of appropriately-sized heterogeneous compute clusters in a research or commercial environments. Success for this...
Local area
Schmidt Entities
San Francisco, CA
14 days ago
Senior Infrastructure Engineer
...Judgment Labs builds infrastructure for Agent Behavior Monitoring (ABM). While traditional... ...Instead of reactive incident triage, they cluster patterns across conversations and... ...are looking for a Senior Infrastructure Engineer to architect and scale the deployment infrastructure...
Judgment Labs
San Francisco, CA
2 days ago
Infrastructure Engineer (Storage)
$180k - $200k
...Infrastructure Engineer (Storage) Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for... ...overhead Improve lifecycle management of storage clusters, from deployment through maintenance and scaling Systems...
Remote work
Work from home
Flexible hours
Lightning AI
San Francisco, CA
8 days ago
Senior HPC & GPU Infrastructure Engineer
...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary... ...health, reliability, and performance of our GPU compute cluster. You will be the primary PyTorch custodian of our high-...
Flexible hours
Sciforium
San Francisco, CA
12 days ago
Senior Infrastructure Engineer
$350 per month
...Senior Infrastructure Engineer We're looking for a Senior Infrastructure Engineer to own and scale the systems that power Gumloop. You'll... ...and scale our infrastructure. Design and maintain Kubernetes clusters, CI/CD pipelines, and cloud infrastructure on GCP. Ensure high...
Temporary work
Remote work
AgentHub Inc.
San Francisco, CA
4 days ago
Staff HPC & GPU Network Deployment Engineer
...solutions firm based in San Francisco is seeking a Staff Network Deployment Engineer. The candidate will lead the deployment of advanced network systems that support high-performance GPU compute clusters. The role requires a minimum of 8 years of network engineering...
Crusoe Energy Systems LLC
San Francisco, CA
3 days ago
Senior Compute Infrastructure Engineer for Exascale AI
xAI is looking for an expert to join the Compute Infrastructure team to design and manage massive-scale clusters that support AI workloads. You will push the boundaries of container orchestration and manage high-performance resources. The role demands proficiency in virtualization...
xAI
San Francisco, CA
2 days ago
RL Infrastructure Engineer — Frontier AI Research
$300k
A rare infrastructure role in a frontier RL research operation. Compensation: $300K-$500K base... ...stage, well‑funded AI company, a small engineering team works with top researchers to... ...generation, evaluation, data movement, and cluster utilization. Establish engineering standards...
H1b
Aionia Group
San Francisco, CA
5 days ago
Infrastructure Engineer
$175k - $200k
...can‑do attitude. About the role Most infrastructure roles ask you to maintain what exists.... ...living system, and we are looking for an engineer who sees that as an opportunity rather... ...including fleet management, networking, and cluster operations at scale. Designing and...
Casual work
Flexible hours
Sight Machine, Inc.
San Francisco, CA
2 days ago
Founding Infrastructure Engineer
$200k - $260k
Rebuild Matterhaul's infrastructure and core systems from zero — AWS, Kubernetes, Golang, and... ...and pipeline choices that the rest of engineering will build on for years. Founding infra... ...fleet (API gateway, GraphQL, workflow cluster, public catalog, durable streams, OpenFGA...
Full time
Work at office
Local area
Matterhaul Inc.
San Francisco, CA
2 days ago
Infrastructure Engineer
$165k - $200k
...projects. What You'll Do As a member of our infrastructure team, you'll be at the heart of a fast-... ...many hats—acting as an infrastructure engineer one moment, and a developer, or even a... ..., databases, file storage, search clusters, microservices, and machine learning pipelines...
Second job
Remote work
Work from home
Relocation package
Flexible hours
Roboflow
San Francisco, CA
2 days ago
Infrastructure Engineer
We’re a team of ex-Google engineers who built some of the largest defensive platforms on... ...very best. You’ll build and scale the infrastructure that ingests rich data streams from partners... ...GPU workloads across dynamic clusters, to implementing real-time feedback loops...
Flexible hours
Cerebras
San Francisco, CA
5 days ago
Senior Infrastructure Engineer
$165.84k - $228.03k
...time Location Type Hybrid Department Engineering Compensation San Francisco $165,838 -... ...Role Summary Weekend is seeking a Senior Infrastructure Engineer to ensure the stability,... ...platform grows. You'll own our Kubernetes cluster, lead security improvements, drive infrastructure...
Full time
Work at office
Remote work
Work from home
Relocation
Visa sponsorship
Flexible hours
careers.bitkraft.vc - Jobboard
San Francisco, CA
3 days ago
Network Engineer, Capacity and Efficiency
$320k - $405k
...growing group of committed researchers, engineers, policy experts, and business leaders... ...attribution story for non-accelerator infrastructure — the network, compute, and storage backbone... ...that moves petabytes between training clusters, inference fleets, and object storage...
Contract work
Work at office
Visa sponsorship
Flexible hours
Anthropic
San Francisco, CA
1 day ago
Sr. Network Engineer
...superintelligence cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers.... ...hardware for new and existing clusters Ensure high availability of our... ...operations and on-call rotation for Network Engineering team You Have 10+ years of...
Work at office
Local area
Work from home
Flexible hours
Lambda Corporation
San Francisco, CA
7 days ago
Infrastructure Engineer, Security
$200k
...Infrastructure Engineer, Security San Francisco Thinking Machines Lab's mission is to empower humanity through advancing collaborative... ...all of these: Experience with ML infrastructure, GPU clusters, or large-scale training environments (schedulers, job queues...
Local area
Immediate start
Visa sponsorship
Work visa
Relocation package
Thinking Machines Lab
San Francisco, CA
3 days ago
Principal Infrastructure Engineer
$2,000 per month
...the world of data with you. The Role As a Principal Infrastructure Engineer , you will help lead and build out the automation for provisioning... ..., with experience in managing and operating Kubernetes clusters across multiple environments. -Proficiency in working...
NextData
San Francisco, CA
5 days ago
Remote: Global GPU Cloud Architect & Infrastructure Lead
A leading AI infrastructure company seeks a Head of AI Infrastructure to define the technical roadmap for a global GPU cloud platform. This... ...and significant expertise in Kubernetes and GPU clusters. The ideal candidate will lead a distributed team, working in...
Remote job
Immediate start
Blue Signal Search
San Francisco, CA
5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Infrastructure / Cluster Engineer. Be the first to apply!