Infrastructure / Cluster Engineer
Gimlet Labs
About Us Gimlet is building the next generation of AI infrastructure: large-scale AI datacenters and the orchestration platform that coordinates them. The future of AI will require vastly more compute than exists today. But as AI workloads become more complex and new hardware architectures emerge, simply deploying more GPUs isn't enough. The challenge is making increasingly diverse compute work together. Gimlet's platform intelligently partitions and routes workloads across heterogeneous hardware, enabling step-function improvements in performance and efficiency. Customers deploy through production-grade APIs without needing to think about hardware selection, placement, or optimization. We work with foundation labs, hyperscalers, and AI-native companies to power production workloads at massive scale and help define the infrastructure layer for the future of AI.
About this Role We are looking for an Infrastructure / Cluster Engineer to design, build, and operate the cluster infrastructure behind Gimlet's heterogeneous inference cloud. Unlike traditional cloud platforms built around a single hardware ecosystem, Gimlet's infrastructure spans multiple accelerator vendors and architectures. Infrastructure engineers play a key role in bringing new hardware platforms online, building the operational abstractions that make heterogeneous infrastructure manageable at scale, and ensuring new silicon can serve production workloads reliably from day one. This role is highly hands-on. You will work across bare metal, Linux, Kubernetes or cluster schedulers, high-speed networking, observability, provisioning, and incident response. You will partner closely with distributed systems, runtime, compiler, and hardware teams to ensure Gimlet's infrastructure can support demanding AI workloads at production scale. What you will work on
About this Role We are looking for an Infrastructure / Cluster Engineer to design, build, and operate the cluster infrastructure behind Gimlet's heterogeneous inference cloud. Unlike traditional cloud platforms built around a single hardware ecosystem, Gimlet's infrastructure spans multiple accelerator vendors and architectures. Infrastructure engineers play a key role in bringing new hardware platforms online, building the operational abstractions that make heterogeneous infrastructure manageable at scale, and ensuring new silicon can serve production workloads reliably from day one. This role is highly hands-on. You will work across bare metal, Linux, Kubernetes or cluster schedulers, high-speed networking, observability, provisioning, and incident response. You will partner closely with distributed systems, runtime, compiler, and hardware teams to ensure Gimlet's infrastructure can support demanding AI workloads at production scale. What you will work on
- Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference.
- Build automation for provisioning, configuration, upgrades, validation, and lifecycle management.
- Design and scale provisioning systems for heterogeneous bare-metal infrastructure across multiple datacenters and hardware vendors.Operate cluster scheduling, resource allocation, isolation, quotas, and utilization systems.
- Debug complex production issues across Linux, networking, storage, drivers, firmware, and orchestration layers.
- Build and operate high-performance networking infrastructure, including RDMA-enabled environments and accelerator interconnects.
- Build observability for cluster health, capacity, performance, failures, and workload behavior.
- Improve reliability, availability, and recovery across multi-node production systems.
- Work with distributed systems and runtime teams to support low-latency, high-throughput inference workloads.
- Evaluate and integrate new hardware platforms, accelerators, networking technologies, and datacenter designs.
- Create runbooks, operational standards, and incident response practices as the fleet scales.
- Experience in infrastructure, cluster engineering, platform engineering, SRE, HPC, or distributed systems.
- Deep Linux systems experience, including debugging performance, networking, storage, processes, and kernel-level issues.
- Experience operating Kubernetes, Slurm, Nomad, or similar orchestration and scheduling systems.
- Strong automation skills using tools such as Terraform, Ansible, Helm, Python, Go, or equivalent.
- Experience with GPU or accelerator infrastructure, including drivers, firmware, CUDA/ROCm stacks, or hardware validation.
- Familiarity with high-performance networking such as InfiniBand, RoCE, high-speed Ethernet, or datacenter fabrics.
- Strong operational judgment: you know how to build systems that are observable, recoverable, and boring in production.
- Comfort working in a fast-moving startup environment with high ownership and ambiguity.
- Experience building or operating AI inference, training, HPC, or neocloud infrastructure.
- Experience with bare-metal provisioning, PXE/iPXE, image pipelines, BIOS/firmware management, or rack bring-up.
- Experience with multi-tenant cluster isolation, quota systems, fair scheduling, or usage accounting.
- Experience debugging distributed workload performance across compute, memory, network, and storage bottlenecks.
- Experience building observability platforms using technologies such as Prometheus, OpenTelemetry, Grafana, or similar tooling.
- Familiarity with heterogeneous hardware environments across NVIDIA, AMD, Intel, ARM, or emerging accelerators.
Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Infrastructure / Cluster Engineer in San Francisco, CA vacancy
- Gimlet is building AI infrastructure and orchestration platforms for large-scale AI datacenters. This Infrastructure/Cluster Engineer role involves designing, building, and operating heterogeneous cluster infrastructure that intelligently routes workloads across diverse...Suggested
$320k - $405k
...Staff Engineer, Cluster Infrastructure Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers...SuggestedWork at officeVisa sponsorshipFlexible hours- Baseten is hiring a Network Engineer (Data Centers) in San Francisco to design and own the high-performance network infrastructure for their GPU clusters. This senior role collaborates closely with hardware and platform teams, directly impacting model performance and inference...SuggestedFlexible hours
- ...the Role A well-funded early-stage Kubernetes infrastructure company is hiring a Frontend Engineer to design and build the interface for their flagship... ...visualizations, and real-time UIs that translate complex cluster state into clear, actionable experiences. This is a...SuggestedRemote work
- Neura Market is seeking an HPC Engineer to build and configure large-scale HPC clusters for AI workloads. This role requires working 4 days a week onsite in San Francisco/Bellevue, where you will collaborate closely with teams to troubleshoot and improve systems. The ideal...Suggested
$230k
...models. About the Role We are looking for engineers to operate the next generation of compute clusters that power OpenAI's frontier research. This... ...blends distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale...- ...collaboration through cutting-edge platforms that empower enterprises to evolve intelligently. The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic development, integration, and optimization of AI and platform initiatives. The role will focus...
$193k - $234k
...Staff Network Deployment Engineer Crusoe is on a mission to accelerate the abundance... .... As the only vertically integrated AI infrastructure company built from the ground up, we own... ...-performance networks for GPU compute clusters. As we rapidly expand our footprint of...Temporary workWork at officeRemote work$131.76k - $161.06k
...Software Engineer ESnet delivers high-bandwidth, reliable networking that connects... ...network and DOE's Integrated Research Infrastructure. As part of ESnet's Pilots and Prototypes... ...CD pipelines, and cloud-native compute clusters for R&D and prototype environments....Full timeWork at officeRemote work$180k - $200k
...Infrastructure Engineer (Observability) Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform... ...DCGM Experience monitoring large-scale GPU or HPC clusters Familiarity with InfiniBand fabric observability Experience...Remote workWork from homeFlexible hours$165k - $200k
...Infrastructure Engineer As a member of our infrastructure team, you'll be at the heart of a fast-paced startup environment. Your primary focus... ...our cloud architecture, databases, file storage, search clusters, microservices, and machine learning pipelines. You'll work...Work from home- ...uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable... .... Join us and help build the platform engineers turn to to ship AI products. THE... ...operational excellence. You aren't just managing clusters; you are acting as the technical glue...Work experience placementWork at officeFlexible hours
$150k - $170k
...Infrastructure Engineer Schmidt Sciences is a nonprofit organization founded in 2024 by Eric and Wendy Schmidt that works to accelerate scientific... ...management of appropriately-sized heterogeneous compute clusters in a research or commercial environments. Success for this...Local area- ...Judgment Labs builds infrastructure for Agent Behavior Monitoring (ABM). While traditional... ...Instead of reactive incident triage, they cluster patterns across conversations and... ...are looking for a Senior Infrastructure Engineer to architect and scale the deployment infrastructure...
$180k - $200k
...Infrastructure Engineer (Storage) Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for... ...overhead Improve lifecycle management of storage clusters, from deployment through maintenance and scaling Systems...Remote workWork from homeFlexible hours- ...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary... ...health, reliability, and performance of our GPU compute cluster. You will be the primary PyTorch custodian of our high-...Flexible hours
$350 per month
...Senior Infrastructure Engineer We're looking for a Senior Infrastructure Engineer to own and scale the systems that power Gumloop. You'll... ...and scale our infrastructure. Design and maintain Kubernetes clusters, CI/CD pipelines, and cloud infrastructure on GCP. Ensure high...Temporary workRemote work- ...solutions firm based in San Francisco is seeking a Staff Network Deployment Engineer. The candidate will lead the deployment of advanced network systems that support high-performance GPU compute clusters. The role requires a minimum of 8 years of network engineering...
- xAI is looking for an expert to join the Compute Infrastructure team to design and manage massive-scale clusters that support AI workloads. You will push the boundaries of container orchestration and manage high-performance resources. The role demands proficiency in virtualization...
$300k
A rare infrastructure role in a frontier RL research operation. Compensation: $300K-$500K base... ...stage, well‑funded AI company, a small engineering team works with top researchers to... ...generation, evaluation, data movement, and cluster utilization. Establish engineering standards...H1b$175k - $200k
...can‑do attitude. About the role Most infrastructure roles ask you to maintain what exists.... ...living system, and we are looking for an engineer who sees that as an opportunity rather... ...including fleet management, networking, and cluster operations at scale. Designing and...Casual workFlexible hours$200k - $260k
Rebuild Matterhaul's infrastructure and core systems from zero — AWS, Kubernetes, Golang, and... ...and pipeline choices that the rest of engineering will build on for years. Founding infra... ...fleet (API gateway, GraphQL, workflow cluster, public catalog, durable streams, OpenFGA...Full timeWork at officeLocal area$165k - $200k
...projects. What You'll Do As a member of our infrastructure team, you'll be at the heart of a fast-... ...many hats—acting as an infrastructure engineer one moment, and a developer, or even a... ..., databases, file storage, search clusters, microservices, and machine learning pipelines...Second jobRemote workWork from homeRelocation packageFlexible hours- We’re a team of ex-Google engineers who built some of the largest defensive platforms on... ...very best. You’ll build and scale the infrastructure that ingests rich data streams from partners... ...GPU workloads across dynamic clusters, to implementing real-time feedback loops...Flexible hours
$165.84k - $228.03k
...time Location Type Hybrid Department Engineering Compensation San Francisco $165,838 -... ...Role Summary Weekend is seeking a Senior Infrastructure Engineer to ensure the stability,... ...platform grows. You'll own our Kubernetes cluster, lead security improvements, drive infrastructure...Full timeWork at officeRemote workWork from homeRelocationVisa sponsorshipFlexible hours$320k - $405k
...growing group of committed researchers, engineers, policy experts, and business leaders... ...attribution story for non-accelerator infrastructure — the network, compute, and storage backbone... ...that moves petabytes between training clusters, inference fleets, and object storage...Contract workWork at officeVisa sponsorshipFlexible hours- ...superintelligence cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers.... ...hardware for new and existing clusters Ensure high availability of our... ...operations and on-call rotation for Network Engineering team You Have 10+ years of...Work at officeLocal areaWork from homeFlexible hours
$200k
...Infrastructure Engineer, Security San Francisco Thinking Machines Lab's mission is to empower humanity through advancing collaborative... ...all of these: Experience with ML infrastructure, GPU clusters, or large-scale training environments (schedulers, job queues...Local areaImmediate startVisa sponsorshipWork visaRelocation package$2,000 per month
...the world of data with you. The Role As a Principal Infrastructure Engineer , you will help lead and build out the automation for provisioning... ..., with experience in managing and operating Kubernetes clusters across multiple environments. -Proficiency in working...- A leading AI infrastructure company seeks a Head of AI Infrastructure to define the technical roadmap for a global GPU cloud platform. This... ...and significant expertise in Kubernetes and GPU clusters. The ideal candidate will lead a distributed team, working in...Remote jobImmediate start
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Infrastructure / Cluster Engineer. Be the first to apply!
Related searches
- entry level infrastructure engineer San Francisco, CA
- infrastructure automation engineer San Francisco, CA
- security infrastructure engineer San Francisco, CA
- senior infrastructure engineer San Francisco, CA
- remote infrastructure engineer San Francisco, CA
- infrastructure engineering manager San Francisco, CA
- infrastructure engineer San Francisco, CA
- principal infrastructure engineer San Francisco, CA
- data infrastructure engineer San Francisco, CA
- infrastructure developer San Francisco, CA

