Infrastructure / Cluster Engineer

Gimlet Labs

Infrastructure / Cluster Engineer

Gimlet is building the next generation of AI infrastructure: large-scale AI datacenters and the orchestration platform that coordinates them.

The future of AI will require vastly more compute than exists today. But as AI workloads become more complex and new hardware architectures emerge, simply deploying more GPUs isn't enough. The challenge is making increasingly diverse compute work together.

Gimlet's platform intelligently partitions and routes workloads across heterogeneous hardware, enabling step-function improvements in performance and efficiency. Customers deploy through production-grade APIs without needing to think about hardware selection, placement, or optimization.

We work with foundation labs, hyperscalers, and AI-native companies to power production workloads at massive scale and help define the infrastructure layer for the future of AI.

About this Role

We are looking for an Infrastructure / Cluster Engineer to design, build, and operate the cluster infrastructure behind Gimlet's heterogeneous inference cloud. Unlike traditional cloud platforms built around a single hardware ecosystem, Gimlet's infrastructure spans multiple accelerator vendors and architectures. Infrastructure engineers play a key role in bringing new hardware platforms online, building the operational abstractions that make heterogeneous infrastructure manageable at scale, and ensuring new silicon can serve production workloads reliably from day one.

This role is highly hands-on. You will work across bare metal, Linux, Kubernetes or cluster schedulers, high-speed networking, observability, provisioning, and incident response. You will partner closely with distributed systems, runtime, compiler, and hardware teams to ensure Gimlet's infrastructure can support demanding AI workloads at production scale.

What You Will Work On

Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference.
Build automation for provisioning, configuration, upgrades, validation, and lifecycle management.
Design and scale provisioning systems for heterogeneous bare-metal infrastructure across multiple datacenters and hardware vendors. Operate cluster scheduling, resource allocation, isolation, quotas, and utilization systems.
Debug complex production issues across Linux, networking, storage, drivers, firmware, and orchestration layers.
Build and operate high-performance networking infrastructure, including RDMA-enabled environments and accelerator interconnects.
Build observability for cluster health, capacity, performance, failures, and workload behavior.
Improve reliability, availability, and recovery across multi-node production systems.
Work with distributed systems and runtime teams to support low-latency, high-throughput inference workloads.
Evaluate and integrate new hardware platforms, accelerators, networking technologies, and datacenter designs.
Create runbooks, operational standards, and incident response practices as the fleet scales.

You May Be A Good Fit If

Experience in infrastructure, cluster engineering, platform engineering, SRE, HPC, or distributed systems.
Deep Linux systems experience, including debugging performance, networking, storage, processes, and kernel-level issues.
Experience operating Kubernetes, Slurm, Nomad, or similar orchestration and scheduling systems.
Strong automation skills using tools such as Terraform, Ansible, Helm, Python, Go, or equivalent.
Experience with GPU or accelerator infrastructure, including drivers, firmware, CUDA/ROCm stacks, or hardware validation.
Familiarity with high-performance networking such as InfiniBand, RoCE, high-speed Ethernet, or datacenter fabrics.
Strong operational judgment: you know how to build systems that are observable, recoverable, and boring in production.
Comfort working in a fast-moving startup environment with high ownership and ambiguity.

Strong Candidates May Also Have

Experience building or operating AI inference, training, HPC, or neocloud infrastructure.
Experience with bare-metal provisioning, PXE/iPXE, image pipelines, BIOS/firmware management, or rack bring-up.
Experience with multi-tenant cluster isolation, quota systems, fair scheduling, or usage accounting.
Experience debugging distributed workload performance across compute, memory, network, and storage bottlenecks.
Experience building observability platforms using technologies such as Prometheus, OpenTelemetry, Grafana, or similar tooling.
Familiarity with heterogeneous hardware environments across NVIDIA, AMD, Intel, ARM, or emerging accelerators.

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Infrastructure / Cluster Engineer in San Francisco, CA vacancy

Staff+ Infrastructure Engineer, Cluster Infrastructure
$320k - $405k
...growing group of committed researchers, engineers, policy experts, and business leaders... ...systems. About the role Anthropic's Infrastructure organization is foundational to our... ...frontier capabilities can go hand in hand. Cluster Infra owns the full lifecycle of...
Suggested
Work at office
Visa sponsorship
Flexible hours
Anthropic
San Francisco, CA
1 day ago
Senior Data Center Network Engineer - GPU Clusters
Baseten is hiring a Network Engineer (Data Centers) in San Francisco to design and own the high-performance network infrastructure for their GPU clusters. This senior role collaborates closely with hardware and platform teams, directly impacting model performance and inference...
Suggested
Flexible hours
Baseten
San Francisco, CA
1 day ago
Kubernetes Platform Engineer: GitOps & Multi-Cluster
Getclera seeks a Platform Engineer to build and evolve ClusterdOS while abstracting Kubernetes complexity. You will work with the founding... ...to design systems and implement GitOps workflows, making multi-cluster management intuitive for developers. The ideal candidate has 2+...
Suggested
Getclera
San Francisco, CA
1 day ago
Frontend Engineer - Cluster OS Platform
...the Role A well-funded early-stage Kubernetes infrastructure company is hiring a Frontend Engineer to design and build the interface for their flagship... ...visualizations, and real-time UIs that translate complex cluster state into clear, actionable experiences. This is a...
Suggested
Remote work
Clera
San Francisco, CA
3 days ago
Senior Cluster SRE & Cloud Ops Engineer
...years of experience in Site Reliability Engineering, DevOps, or a similar role focused on... ...Experience of managing data center grade GPU clusters with GPU (and peripherals like HBM and... ...) Experience with machine learning infrastructure, model serving, or distributed AI frameworks...
Suggested
Fireworks AI
San Francisco, CA
4 days ago
HPC Cloud Engineer - AI Clusters & Automation
Neura Market is seeking an HPC Engineer to build and configure large-scale HPC clusters for AI workloads. This role requires working 4 days a week onsite in San Francisco/Bellevue, where you will collaborate closely with teams to troubleshoot and improve systems. The ideal...
Neura Market
San Francisco, CA
4 days ago
Software Engineer, Frontier Clusters Infrastructure
$230k
...models. About the Role We are looking for engineers to operate the next generation of compute clusters that power OpenAI's frontier research. This... ...blends distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale...
OpenAI
San Francisco, CA
2 days ago
Head of Platform/AI Cluster Management - System Integrator
...collaboration through cutting-edge platforms that empower enterprises to evolve intelligently. The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic development, integration, and optimization of AI and platform initiatives. The role will focus...
Hamilton Barnes Associates Limited
San Francisco, CA
4 days ago
Research Networking Systems Software Engineer
$131.76k - $161.06k
...Software Engineer ESnet delivers high-bandwidth, reliable networking that connects... ...network and DOE's Integrated Research Infrastructure. As part of ESnet's Pilots and Prototypes... ...CD pipelines, and cloud-native compute clusters for R&D and prototype environments....
Full time
Work at office
Remote work
Berkely Lab
San Francisco, CA
1 day ago
Senior Infrastructure Engineer
...Judgment Labs builds infrastructure for Agent Behavior Monitoring (ABM). While traditional... ...Instead of reactive incident triage, they cluster patterns across conversations and... ...are looking for a Senior Infrastructure Engineer to architect and scale the deployment infrastructure...
Judgment Labs
San Francisco, CA
1 day ago
Infrastructure Engineer (Storage)
$180k - $200k
...Infrastructure Engineer (Storage) Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for... ...overhead Improve lifecycle management of storage clusters, from deployment through maintenance and scaling Systems...
Remote work
Work from home
Flexible hours
Lightning AI
San Francisco, CA
2 days ago
Infrastructure Engineer (Observability)
$180k - $200k
...Infrastructure Engineer (Observability) Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform... ...DCGM Experience monitoring large-scale GPU or HPC clusters Familiarity with InfiniBand fabric observability Experience...
Remote work
Work from home
Flexible hours
Lightning AI
San Francisco, CA
1 day ago
Compute Infrastructure Engineer, AI and Advanced Computing Center
$150k - $170k
...Infrastructure Engineer Schmidt Sciences is a nonprofit organization founded in 2024 by Eric and Wendy Schmidt that works to accelerate scientific... ...management of appropriately-sized heterogeneous compute clusters in a research or commercial environments. Success for this...
Local area
Schmidt Entities
San Francisco, CA
8 days ago
Senior HPC & GPU Infrastructure Engineer
...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary... ...health, reliability, and performance of our GPU compute cluster. You will be the primary PyTorch custodian of our high-...
Flexible hours
Sciforium
San Francisco, CA
6 days ago
Senior Infrastructure Engineer
$350 per month
...across hundreds of integrations with no engineering background required. The platform has... ...Overview We're looking for a Senior Infrastructure Engineer to own and scale the systems... ...infrastructure Design and maintain Kubernetes clusters, CI/CD pipelines, and cloud...
Temporary work
Remote work
Gumloop
San Francisco, CA
1 day ago
Infrastructure Ops Engineer
...uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable... .... Join us and help build the platform engineers turn to to ship AI products. THE... ...operational excellence. You aren't just managing clusters; you are acting as the technical glue...
Work experience placement
Work at office
Flexible hours
Baseten
San Francisco, CA
2 days ago
Infrastructure Engineer
$165k - $200k
...Infrastructure Engineer As a member of our infrastructure team, you'll be at the heart of a fast-paced startup environment. Your primary focus... ...our cloud architecture, databases, file storage, search clusters, microservices, and machine learning pipelines. You'll work...
Work from home
Roboflow
San Francisco, CA
2 days ago
Staff HPC & GPU Network Deployment Engineer
...solutions firm based in San Francisco is seeking a Staff Network Deployment Engineer. The candidate will lead the deployment of advanced network systems that support high-performance GPU compute clusters. The role requires a minimum of 8 years of network engineering...
Crusoe Energy Systems LLC
San Francisco, CA
2 days ago
Staff Network Deployment Engineer, Lab
$193k - $234k
.... As the only vertically integrated AI infrastructure company built from the ground up, we own... ...-oriented Staff Network Deployment Engineer to lead the physical and logical implementation... ...-performance networks for GPU compute clusters. As we rapidly expand our footprint of...
Temporary work
Work at office
Remote work
Crusoe Energy Systems LLC
San Francisco, CA
2 days ago
Infrastructure Engineer San Francisco
...spend on each) Design and implement robust infrastructure to help Greptile keep up with growing... ...of software, infrastructure or cloud engineering Strong background in computer networking... ...Containers, Virtual Machines, Compute Clusters. You enjoy playing with Linux systems and...
Work at office
Remote work
Relocation package
Greptile
San Francisco, CA
3 days ago
Senior Infrastructure Engineer - Kubernetes, AWS & DevSecOps
Volley Inc. is seeking a Senior Infrastructure Engineer to maintain the stability and security of its infrastructure as the AI gaming platform... ...pivotal role, you will lead the management of the Kubernetes cluster, drive infrastructure automation, and enhance security...
Flexible hours
Volley Inc.
San Francisco, CA
4 days ago
Founding Infrastructure Engineer
$200k - $260k
Rebuild Matterhaul's infrastructure and core systems from zero — AWS, Kubernetes, Golang, and... ...and pipeline choices that the rest of engineering will build on for years. Founding infra... ...fleet (API gateway, GraphQL, workflow cluster, public catalog, durable streams, OpenFGA...
Full time
Work at office
Local area
Matterhaul Inc.
San Francisco, CA
1 day ago
Infrastructure Engineer
We’re a team of ex-Google engineers who built some of the largest defensive platforms on... ...very best. You’ll build and scale the infrastructure that ingests rich data streams from partners... ...GPU workloads across dynamic clusters, to implementing real-time feedback loops...
Flexible hours
Cerebras
San Francisco, CA
4 days ago
Infrastructure Engineer
$165k - $200k
...projects. What You'll Do As a member of our infrastructure team, you'll be at the heart of a fast-... ...many hats—acting as an infrastructure engineer one moment, and a developer, or even a... ..., databases, file storage, search clusters, microservices, and machine learning pipelines...
Second job
Remote work
Work from home
Relocation package
Flexible hours
Roboflow
San Francisco, CA
1 day ago
Senior Infrastructure Engineer
$165.84k - $228.03k
...time Location Type Hybrid Department Engineering Compensation San Francisco $165,838 -... ...Role Summary Weekend is seeking a Senior Infrastructure Engineer to ensure the stability,... ...platform grows. You'll own our Kubernetes cluster, lead security improvements, drive infrastructure...
Full time
Work at office
Remote work
Work from home
Relocation
Visa sponsorship
Flexible hours
careers.bitkraft.vc - Jobboard
San Francisco, CA
2 days ago
Senior Infrastructure Engineer
Senior Infrastructure Engineer As a senior infrastructure engineer, you’ll be Architecting, implementing, and rolling out large-scale infrastructure... ...our infrastructure. Designing and maintaining Kubernetes clusters, CI/CD pipelines, and cloud infrastructure. Ensuring high...
Immediate start
Trial period
Visa sponsorship
Relocation package
Gumloop
San Francisco, CA
3 days ago
Senior Compute Infrastructure Engineer for Exascale AI
xAI is looking for an expert to join the Compute Infrastructure team to design and manage massive-scale clusters that support AI workloads. You will push the boundaries of container orchestration and manage high-performance resources. The role demands proficiency in virtualization...
xAI
San Francisco, CA
1 day ago
RL Infrastructure Engineer — Frontier AI Research
$300k
A rare infrastructure role in a frontier RL research operation. Compensation: $300K-$500K base... ...stage, well‑funded AI company, a small engineering team works with top researchers to... ...generation, evaluation, data movement, and cluster utilization. Establish engineering standards...
H1b
Aionia Group
San Francisco, CA
4 days ago
Infrastructure Engineer
$175k - $200k
...can‑do attitude. About the role Most infrastructure roles ask you to maintain what exists.... ...living system, and we are looking for an engineer who sees that as an opportunity rather... ...including fleet management, networking, and cluster operations at scale. Designing and...
Casual work
Flexible hours
Sight Machine, Inc.
San Francisco, CA
2 days ago
Network Engineer, Capacity and Efficiency
$320k - $405k
...growing group of committed researchers, engineers, policy experts, and business leaders... ...attribution story for non-accelerator infrastructure — the network, compute, and storage backbone... ...that moves petabytes between training clusters, inference fleets, and object storage...
Contract work
Work at office
Visa sponsorship
Flexible hours
Anthropic
San Francisco, CA
21 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Infrastructure / Cluster Engineer. Be the first to apply!