Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Hardware Health Engineer - Scale GPU Clusters

United States Digital Space LLC

United States Digital Space LLC seeks an experienced Engineer for the Hardware Health and Observability team, responsible for maintaining and optimizing the health of our global compute fleet. You will define health signals and build automated remediation systems across millions of GPUs and CPUs. The ideal candidate boasts over 7 years in software or infrastructure engineering, with a strong command of Python and experience in large-scale systems. We prioritize continuous availability for our research and product teams. #J-18808-Ljbffr United States Digital Space LLC

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Senior Hardware Health Engineer - Scale GPU Clusters in San Francisco, CA vacancy
  •  ...infrastructure firm based in California is looking for an experienced GPU Infrastructure Manager to join their team. The role involves architecting and deploying GPU clusters globally, alongside mentoring junior engineers. Candidates should have 10+ years of experience with GPU... 
    Senior

    The San Francisco Compute Company

    San Francisco, CA
    1 day ago
  •  ...Supercomputing to design, build, and operate a GPU supercomputing environment. You will enable fast, large-scale research by ensuring high-performance computing...  ...candidate has a strong background in operating GPU clusters, container orchestration, and deep learning... 
    Senior

    Radical Numerics Inc.

    San Francisco, CA
    4 days ago
  • $250k

    Hamilton Barnes Associates Limited in San Francisco is seeking an experienced engineer to design and maintain large-scale GPU clusters for training and inference. The candidate should have over 7 years in SRE or DevOps, with strong skills in Kubernetes and Linux systems... 
    Senior

    Hamilton Barnes Associates Limited

    San Francisco, CA
    4 days ago
  • A leading AI infrastructure company is looking for a Senior Site Reliability Engineer to design and operate large-scale GPU clusters. In this role, you will work closely with clients to troubleshoot and optimize AI infrastructure. The ideal candidate has extensive experience... 
    Senior

    Andromeda

    San Francisco, CA
    1 day ago
  • $300k

    A stealth-mode startup in San Francisco seeks a Platform Engineer/Senior Site Reliability Engineer to manage their AI and cloud platform. You will design and maintain large-scale GPU clusters, create automation pipelines, and enhance system reliability. Ideal candidates... 
    Senior

    Hamilton Barnes Associates Limited

    San Francisco, CA
    2 days ago
  •  ...history. When people finance GPU clusters, the datacenters housing...  ...the market? Otherwise, as AI scales, compute only becomes available...  ...culture, mentor junior engineers, and learn from our customers...  ...You deeply understand server hardware fundamentals, including GPUs... 
    Long term contract
    Contract work
    Fixed term contract
    Work at office
    Local area
    Visa sponsorship
    Shift work
    3 days per week

    The San Francisco Compute Company

    San Francisco, CA
    1 day ago
  • $190k - $270k

    AI Chopping Block, Inc. is seeking an AI Infrastructure Engineer in San Francisco. This role requires maintaining user-facing services...  ...Competitive compensation includes a salary range of $190,000 - $270,000, equity, and health benefits. #J-18808-Ljbffr AI Chopping Block, Inc.
    Senior

    AI Chopping Block, Inc.

    San Francisco, CA
    1 day ago
  • Cortes 23 in San Francisco is seeking a Senior Site Reliability Engineer to design and operate large-scale GPU infrastructure. This high-impact role requires deep expertise in distributed systems and a proactive approach to incident management. The successful candidate... 
    Senior
    Remote job

    Cortes 23

    San Francisco, CA
    21 hours ago
  • $160k - $225k

    Cacheflow is seeking a Senior Software Engineer for AI Runtime at Databricks, located in San Francisco. You will be instrumental in building and scaling systems for large-scale GPU training, ensuring high throughput and resilience in training across expansive fleets of... 
    Senior

    Cacheflow

    San Francisco, CA
    21 hours ago
  • Baseten is hiring a Network Engineer (Data Centers) in San Francisco to design and own the high-performance network infrastructure for their GPU clusters. This senior role collaborates closely with hardware and platform teams, directly impacting model performance and inference... 
    Senior
    Flexible hours

    Baseten

    San Francisco, CA
    8 days ago
  • $250k

     ...opportunities? Join a rapidly scaling AI cloud infrastructure...  ...a next-generation GPU platform designed for AI training...  ...company is looking for a Senior / Staff Site Reliability Engineer to support and scale large...  ...for GPU compute clusters Collaborate with ML, data... 
    Senior
    Permanent employment
    Remote work
    San Francisco, CA
    a month ago
  • $238k - $288k

     ...urgency, who believe in the scale of our ambition and...  ...expands across new GPU and CPU server platforms...  ...we're hiring a founding engineer to lead our BMC firmware...  ...teams from schematics and hardware design docs. Own the...  ...Comprehensive health, dental & vision insurance... 
    Senior
    Temporary work

    Crusoe Energy Systems LLC

    San Francisco, CA
    4 days ago
  • A cutting-edge AI video platform is seeking a Senior Software Engineer (Infrastructure) to manage its GPU deployments and maintain a reliable AWS backbone. You will collaborate with specialized providers to ensure high availability and architect scalable systems, impacting... 
    Senior

    Jack & Jill/External ATS

    San Francisco, CA
    3 days ago
  •  ...AI research company in San Francisco is seeking a Software Engineer for the Fleet Hardware team. This role focuses on ensuring the reliability and...  ...monitoring tools. Ideal candidates have experience with large-scale server environments and proficiency in languages like... 

    OpenAI

    San Francisco, CA
    4 days ago
  • Slope is seeking an experienced engineer to maintain system integrity for its supercomputers...  ...include owning system health checks, diagnosing hardware failures, and developing automation...  ...to minimize disruptions during large-scale operations. This role is crucial for... 

    Slope

    San Francisco, CA
    4 days ago
  •  ...large distributed ML training and inference clusters Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire...  ...scales Analyze, profile and debug low-level GPU operations to optimize performance Stay up-... 
    Senior

    Kindredventures

    San Francisco, CA
    21 hours ago
  •  ...pipelines. Algorithmic Innovation: Refine and scale algorithms for our VPS system including...  ...for our map of the world. Performance Engineering: Optimize complex ML and CV code for maximum...  ...and high-performance execution on GPU/CPU. Benchmarking & Evaluation: Create and... 
    Senior
    Work at office
    3 days per week

    Niantic Spatial

    San Francisco, CA
    4 days ago
  • $170k - $245k

     ...startup agility with HP’s global scale, we’re building intelligent...  ...a diverse, world‑class team—engineers, designers, researchers, and...  ...a multidisciplinary group of hardware, embedded software, and AI engineers...  ...package, including: Health insurance Dental insurance... 
    Senior
    Full time
    Temporary work
    Local area
    Flexible hours

    HP IQ

    San Francisco, CA
    2 days ago
  •  ...Algorithmic Innovation: Refine and scale algorithms for our...  ...map of the world. Performance Engineering: Optimize complex Computer Vision...  ...high-performance execution on GPU/CPU. Benchmarking & Evaluation...  ...disclose to Niantic Spatial, such as health or medical information, race... 
    Senior
    Work at office
    3 days per week

    Niantic Spatial, Inc.

    San Francisco, CA
    4 days ago
  • $248k - $310k

    United States Digital Space LLC in San Francisco is seeking a Staff Engineer for their Safety Experience team. The ideal candidate will lead the development of safety products, working on complex projects that ensure compliance and protect users. Candidates should have... 
    Senior
    Relocation package

    United States Digital Space LLC

    San Francisco, CA
    1 day ago
  • $196k - $220.5k

    Ultimate.ai is seeking a Product Manager to lead efforts in combating scaled abuse on Discord. You'll create strategies and coordinate with cross-functional teams while building user-facing features. The ideal candidate has strong product management experience and a passion... 
    Senior

    Ultimate.ai

    San Francisco, CA
    3 days ago
  • $204k - $240k

    About the role Samsara's Hardware Reliability team enables an exceptional customer...  ...to resolve key issues. As a Senior Hardware Reliability Engineer, you will design quality processes that...  ...Be Inclusive, Win as a Team) as we scale globally and across new offices. Minimum... 
    Senior

    Antler

    San Francisco, CA
    3 days ago
  • $133.58k - $224.5k

     ...transform their operations at scale. Working at Samsara means...  ...About the role: Samsara’s Hardware Reliability team enables...  ...key issues. Samsara’s Senior Hardware Reliability Engineer will design quality processes...  ...remote and flexible working, health benefits, and much, much... 
    Senior
    Full time
    Work at office
    Remote work
    Flexible hours

    Samsara

    San Francisco, CA
    4 days ago
  • A tech startup in AI is seeking a Senior Infrastructure Engineer in San Francisco, CA. This role involves building and scaling a GPU Cloud Marketplace, transforming raw GPUs into...  ...orchestration and effective collaboration with hardware vendors. Strong skills in Terraform,... 
    Senior

    Hyperbolic Labs

    San Francisco, CA
    3 days ago
  •  ...technology infrastructure company in San Francisco is seeking an experienced engineer to manage and operate GPU clusters. The role requires over 5 years of hands-on experience, a deep understanding of hardware systems, and a passion for automating fleet operations. You will... 
    Senior

    The San Francisco Compute Company

    San Francisco, CA
    2 days ago
  • A cutting-edge AI technology company based in San Francisco is seeking a specialist to design and operate large-scale GPU infrastructure. This role requires expertise in deploying GPU systems for high-throughput inference and model performance optimization. The ideal candidate... 
    Senior

    Reflection AI

    San Francisco, CA
    3 days ago
  •  ...technology company in San Francisco is looking for a Senior Software Engineer to build scalable infrastructure for large‑scale training and fine-tuning of foundation models....  ...distributed training systems and optimize GPU utilization while collaborating with cross-functional... 
    Senior

    Baseten

    San Francisco, CA
    3 days ago
  • $200k - $400k

    Inferact is seeking a dedicated cluster administration engineer to manage high-performance GPU compute infrastructure in San Francisco. This hands-on role focuses on optimizing system health and availability for engineering productivity. Ideal candidates will have substantial... 
    Senior
    Remote job

    Inferact

    San Francisco, CA
    2 days ago
  • Hamilton Barnes Associates Limited is seeking a Senior ML Infrastructure Engineer to help build and scale Kubernetes-based machine learning platforms. This role focuses on workload orchestration, GPU scheduling, and ensuring system reliability, working with highly technical... 
    Senior

    Hamilton Barnes Associates Limited

    San Francisco, CA
    1 day ago
  • Hamilton Barnes Associates Limited is looking for a Senior Storage Engineer to support large-scale AI infrastructure in San Francisco. This role involves designing...  ...scalable storage solutions for high-performance GPU platforms. The ideal candidate has extensive... 
    Senior
    Remote job

    Hamilton Barnes Associates Limited

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Hardware Health Engineer - Scale GPU Clusters. Be the first to apply!