Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Infrastructure / Cluster Engineer

Gimlet Labs

Infrastructure / Cluster Engineer

Gimlet is building the next generation of AI infrastructure: large-scale AI datacenters and the orchestration platform that coordinates them.

The future of AI will require vastly more compute than exists today. But as AI workloads become more complex and new hardware architectures emerge, simply deploying more GPUs isn't enough. The challenge is making increasingly diverse compute work together.

Gimlet's platform intelligently partitions and routes workloads across heterogeneous hardware, enabling step-function improvements in performance and efficiency. Customers deploy through production-grade APIs without needing to think about hardware selection, placement, or optimization.

We work with foundation labs, hyperscalers, and AI-native companies to power production workloads at massive scale and help define the infrastructure layer for the future of AI.

About this Role

We are looking for an Infrastructure / Cluster Engineer to design, build, and operate the cluster infrastructure behind Gimlet's heterogeneous inference cloud. Unlike traditional cloud platforms built around a single hardware ecosystem, Gimlet's infrastructure spans multiple accelerator vendors and architectures. Infrastructure engineers play a key role in bringing new hardware platforms online, building the operational abstractions that make heterogeneous infrastructure manageable at scale, and ensuring new silicon can serve production workloads reliably from day one.

This role is highly hands-on. You will work across bare metal, Linux, Kubernetes or cluster schedulers, high-speed networking, observability, provisioning, and incident response. You will partner closely with distributed systems, runtime, compiler, and hardware teams to ensure Gimlet's infrastructure can support demanding AI workloads at production scale.

What You Will Work On
  • Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference.
  • Build automation for provisioning, configuration, upgrades, validation, and lifecycle management.
  • Design and scale provisioning systems for heterogeneous bare-metal infrastructure across multiple datacenters and hardware vendors. Operate cluster scheduling, resource allocation, isolation, quotas, and utilization systems.
  • Debug complex production issues across Linux, networking, storage, drivers, firmware, and orchestration layers.
  • Build and operate high-performance networking infrastructure, including RDMA-enabled environments and accelerator interconnects.
  • Build observability for cluster health, capacity, performance, failures, and workload behavior.
  • Improve reliability, availability, and recovery across multi-node production systems.
  • Work with distributed systems and runtime teams to support low-latency, high-throughput inference workloads.
  • Evaluate and integrate new hardware platforms, accelerators, networking technologies, and datacenter designs.
  • Create runbooks, operational standards, and incident response practices as the fleet scales.
You May Be A Good Fit If
  • Experience in infrastructure, cluster engineering, platform engineering, SRE, HPC, or distributed systems.
  • Deep Linux systems experience, including debugging performance, networking, storage, processes, and kernel-level issues.
  • Experience operating Kubernetes, Slurm, Nomad, or similar orchestration and scheduling systems.
  • Strong automation skills using tools such as Terraform, Ansible, Helm, Python, Go, or equivalent.
  • Experience with GPU or accelerator infrastructure, including drivers, firmware, CUDA/ROCm stacks, or hardware validation.
  • Familiarity with high-performance networking such as InfiniBand, RoCE, high-speed Ethernet, or datacenter fabrics.
  • Strong operational judgment: you know how to build systems that are observable, recoverable, and boring in production.
  • Comfort working in a fast-moving startup environment with high ownership and ambiguity.
Strong Candidates May Also Have
  • Experience building or operating AI inference, training, HPC, or neocloud infrastructure.
  • Experience with bare-metal provisioning, PXE/iPXE, image pipelines, BIOS/firmware management, or rack bring-up.
  • Experience with multi-tenant cluster isolation, quota systems, fair scheduling, or usage accounting.
  • Experience debugging distributed workload performance across compute, memory, network, and storage bottlenecks.
  • Experience building observability platforms using technologies such as Prometheus, OpenTelemetry, Grafana, or similar tooling.
  • Familiarity with heterogeneous hardware environments across NVIDIA, AMD, Intel, ARM, or emerging accelerators.
Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Infrastructure / Cluster Engineer in San Francisco, CA vacancy
  • $320k - $405k

     ...growing group of committed researchers, engineers, policy experts, and business leaders...  ...systems. About the role Anthropic's Infrastructure organization is foundational to our...  ...frontier capabilities can go hand in hand. Cluster Infra owns the full lifecycle of... 
    Suggested
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    1 day ago
  • Baseten is hiring a Network Engineer (Data Centers) in San Francisco to design and own the high-performance network infrastructure for their GPU clusters. This senior role collaborates closely with hardware and platform teams, directly impacting model performance and inference... 
    Suggested
    Flexible hours

    Baseten

    San Francisco, CA
    1 day ago
  • Getclera seeks a Platform Engineer to build and evolve ClusterdOS while abstracting Kubernetes complexity. You will work with the founding...  ...to design systems and implement GitOps workflows, making multi-cluster management intuitive for developers. The ideal candidate has 2+... 
    Suggested

    Getclera

    San Francisco, CA
    1 day ago
  •  ...the Role A well-funded early-stage Kubernetes infrastructure company is hiring a Frontend Engineer to design and build the interface for their flagship...  ...visualizations, and real-time UIs that translate complex cluster state into clear, actionable experiences. This is a... 
    Suggested
    Remote work

    Clera

    San Francisco, CA
    3 days ago
  •  ...years of experience in Site Reliability Engineering, DevOps, or a similar role focused on...  ...Experience of managing data center grade GPU clusters with GPU (and peripherals like HBM and...  ...) Experience with machine learning infrastructure, model serving, or distributed AI frameworks... 
    Suggested

    Fireworks AI

    San Francisco, CA
    4 days ago
  • Neura Market is seeking an HPC Engineer to build and configure large-scale HPC clusters for AI workloads. This role requires working 4 days a week onsite in San Francisco/Bellevue, where you will collaborate closely with teams to troubleshoot and improve systems. The ideal... 

    Neura Market

    San Francisco, CA
    4 days ago
  • $230k

     ...models. About the Role We are looking for engineers to operate the next generation of compute clusters that power OpenAI's frontier research. This...  ...blends distributed systems engineering with hands-on infrastructure work on our largest datacenters. You will scale... 

    OpenAI

    San Francisco, CA
    2 days ago
  •  ...collaboration through cutting-edge platforms that empower enterprises to evolve intelligently. The team is hiring a Head of Platform/AI Cluster Management to oversee the strategic development, integration, and optimization of AI and platform initiatives. The role will focus... 

    Hamilton Barnes Associates Limited

    San Francisco, CA
    4 days ago
  • $131.76k - $161.06k

     ...Software Engineer ESnet delivers high-bandwidth, reliable networking that connects...  ...network and DOE's Integrated Research Infrastructure. As part of ESnet's Pilots and Prototypes...  ...CD pipelines, and cloud-native compute clusters for R&D and prototype environments.... 
    Full time
    Work at office
    Remote work

    Berkely Lab

    San Francisco, CA
    1 day ago
  •  ...Judgment Labs builds infrastructure for Agent Behavior Monitoring (ABM). While traditional...  ...Instead of reactive incident triage, they cluster patterns across conversations and...  ...are looking for a Senior Infrastructure Engineer to architect and scale the deployment infrastructure... 

    Judgment Labs

    San Francisco, CA
    1 day ago
  • $180k - $200k

     ...Infrastructure Engineer (Storage) Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for...  ...overhead Improve lifecycle management of storage clusters, from deployment through maintenance and scaling Systems... 
    Remote work
    Work from home
    Flexible hours

    Lightning AI

    San Francisco, CA
    2 days ago
  • $180k - $200k

     ...Infrastructure Engineer (Observability) Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform...  ...DCGM Experience monitoring large-scale GPU or HPC clusters Familiarity with InfiniBand fabric observability Experience... 
    Remote work
    Work from home
    Flexible hours

    Lightning AI

    San Francisco, CA
    1 day ago
  • $150k - $170k

     ...Infrastructure Engineer Schmidt Sciences is a nonprofit organization founded in 2024 by Eric and Wendy Schmidt that works to accelerate scientific...  ...management of appropriately-sized heterogeneous compute clusters in a research or commercial environments. Success for this... 
    Local area

    Schmidt Entities

    San Francisco, CA
    8 days ago
  •  ...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary...  ...health, reliability, and performance of our GPU compute cluster. You will be the primary PyTorch custodian of our high-... 
    Flexible hours

    Sciforium

    San Francisco, CA
    6 days ago
  • $350 per month

     ...across hundreds of integrations with no engineering background required. The platform has...  ...Overview We're looking for a Senior Infrastructure Engineer to own and scale the systems...  ...infrastructure Design and maintain Kubernetes clusters, CI/CD pipelines, and cloud... 
    Temporary work
    Remote work

    Gumloop

    San Francisco, CA
    1 day ago
  •  ...uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable...  .... Join us and help build the platform engineers turn to to ship AI products. THE...  ...operational excellence. You aren't just managing clusters; you are acting as the technical glue... 
    Work experience placement
    Work at office
    Flexible hours

    Baseten

    San Francisco, CA
    2 days ago
  • $165k - $200k

     ...Infrastructure Engineer As a member of our infrastructure team, you'll be at the heart of a fast-paced startup environment. Your primary focus...  ...our cloud architecture, databases, file storage, search clusters, microservices, and machine learning pipelines. You'll work... 
    Work from home

    Roboflow

    San Francisco, CA
    2 days ago
  •  ...solutions firm based in San Francisco is seeking a Staff Network Deployment Engineer. The candidate will lead the deployment of advanced network systems that support high-performance GPU compute clusters. The role requires a minimum of 8 years of network engineering... 

    Crusoe Energy Systems LLC

    San Francisco, CA
    2 days ago
  • $193k - $234k

     .... As the only vertically integrated AI infrastructure company built from the ground up, we own...  ...-oriented Staff Network Deployment Engineer to lead the physical and logical implementation...  ...-performance networks for GPU compute clusters. As we rapidly expand our footprint of... 
    Temporary work
    Work at office
    Remote work

    Crusoe Energy Systems LLC

    San Francisco, CA
    2 days ago
  •  ...spend on each) Design and implement robust infrastructure to help Greptile keep up with growing...  ...of software, infrastructure or cloud engineering Strong background in computer networking...  ...Containers, Virtual Machines, Compute Clusters. You enjoy playing with Linux systems and... 
    Work at office
    Remote work
    Relocation package

    Greptile

    San Francisco, CA
    3 days ago
  • Volley Inc. is seeking a Senior Infrastructure Engineer to maintain the stability and security of its infrastructure as the AI gaming platform...  ...pivotal role, you will lead the management of the Kubernetes cluster, drive infrastructure automation, and enhance security... 
    Flexible hours

    Volley Inc.

    San Francisco, CA
    4 days ago
  • $200k - $260k

    Rebuild Matterhaul's infrastructure and core systems from zero — AWS, Kubernetes, Golang, and...  ...and pipeline choices that the rest of engineering will build on for years. Founding infra...  ...fleet (API gateway, GraphQL, workflow cluster, public catalog, durable streams, OpenFGA... 
    Full time
    Work at office
    Local area

    Matterhaul Inc.

    San Francisco, CA
    1 day ago
  • We’re a team of ex-Google engineers who built some of the largest defensive platforms on...  ...very best. You’ll build and scale the infrastructure that ingests rich data streams from partners...  ...GPU workloads across dynamic clusters, to implementing real-time feedback loops... 
    Flexible hours

    Cerebras

    San Francisco, CA
    4 days ago
  • $165k - $200k

     ...projects. What You'll Do As a member of our infrastructure team, you'll be at the heart of a fast-...  ...many hats—acting as an infrastructure engineer one moment, and a developer, or even a...  ..., databases, file storage, search clusters, microservices, and machine learning pipelines... 
    Second job
    Remote work
    Work from home
    Relocation package
    Flexible hours

    Roboflow

    San Francisco, CA
    1 day ago
  • $165.84k - $228.03k

     ...time Location Type Hybrid Department Engineering Compensation San Francisco $165,838 -...  ...Role Summary Weekend is seeking a Senior Infrastructure Engineer to ensure the stability,...  ...platform grows. You'll own our Kubernetes cluster, lead security improvements, drive infrastructure... 
    Full time
    Work at office
    Remote work
    Work from home
    Relocation
    Visa sponsorship
    Flexible hours

    careers.bitkraft.vc - Jobboard

    San Francisco, CA
    2 days ago
  • Senior Infrastructure Engineer As a senior infrastructure engineer, you’ll be Architecting, implementing, and rolling out large-scale infrastructure...  ...our infrastructure. Designing and maintaining Kubernetes clusters, CI/CD pipelines, and cloud infrastructure. Ensuring high... 
    Immediate start
    Trial period
    Visa sponsorship
    Relocation package

    Gumloop

    San Francisco, CA
    3 days ago
  • xAI is looking for an expert to join the Compute Infrastructure team to design and manage massive-scale clusters that support AI workloads. You will push the boundaries of container orchestration and manage high-performance resources. The role demands proficiency in virtualization... 

    xAI

    San Francisco, CA
    1 day ago
  • $300k

    A rare infrastructure role in a frontier RL research operation. Compensation: $300K-$500K base...  ...stage, well‑funded AI company, a small engineering team works with top researchers to...  ...generation, evaluation, data movement, and cluster utilization. Establish engineering standards... 
    H1b

    Aionia Group

    San Francisco, CA
    4 days ago
  • $175k - $200k

     ...can‑do attitude. About the role Most infrastructure roles ask you to maintain what exists....  ...living system, and we are looking for an engineer who sees that as an opportunity rather...  ...including fleet management, networking, and cluster operations at scale. Designing and... 
    Casual work
    Flexible hours

    Sight Machine, Inc.

    San Francisco, CA
    2 days ago
  • $320k - $405k

     ...growing group of committed researchers, engineers, policy experts, and business leaders...  ...attribution story for non-accelerator infrastructure — the network, compute, and storage backbone...  ...that moves petabytes between training clusters, inference fleets, and object storage... 
    Contract work
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    21 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Infrastructure / Cluster Engineer. Be the first to apply!