Senior Hardware Health Engineer - Scale GPU Clusters
United States Digital Space LLC
United States Digital Space LLC seeks an experienced Engineer for the Hardware Health and Observability team, responsible for maintaining and optimizing the health of our global compute fleet. You will define health signals and build automated remediation systems across millions of GPUs and CPUs. The ideal candidate boasts over 7 years in software or infrastructure engineering, with a strong command of Python and experience in large-scale systems. We prioritize continuous availability for our research and product teams. #J-18808-Ljbffr United States Digital Space LLC
- ...infrastructure firm based in California is looking for an experienced GPU Infrastructure Manager to join their team. The role involves architecting and deploying GPU clusters globally, alongside mentoring junior engineers. Candidates should have 10+ years of experience with GPU...Senior
- ...Supercomputing to design, build, and operate a GPU supercomputing environment. You will enable fast, large-scale research by ensuring high-performance computing... ...candidate has a strong background in operating GPU clusters, container orchestration, and deep learning...Senior
$250k
Hamilton Barnes Associates Limited in San Francisco is seeking an experienced engineer to design and maintain large-scale GPU clusters for training and inference. The candidate should have over 7 years in SRE or DevOps, with strong skills in Kubernetes and Linux systems...Senior- A leading AI infrastructure company is looking for a Senior Site Reliability Engineer to design and operate large-scale GPU clusters. In this role, you will work closely with clients to troubleshoot and optimize AI infrastructure. The ideal candidate has extensive experience...Senior
$300k
A stealth-mode startup in San Francisco seeks a Platform Engineer/Senior Site Reliability Engineer to manage their AI and cloud platform. You will design and maintain large-scale GPU clusters, create automation pipelines, and enhance system reliability. Ideal candidates...Senior- ...history. When people finance GPU clusters, the datacenters housing... ...the market? Otherwise, as AI scales, compute only becomes available... ...culture, mentor junior engineers, and learn from our customers... ...You deeply understand server hardware fundamentals, including GPUs...Long term contractContract workFixed term contractWork at officeLocal areaVisa sponsorshipShift work3 days per week
$190k - $270k
AI Chopping Block, Inc. is seeking an AI Infrastructure Engineer in San Francisco. This role requires maintaining user-facing services... ...Competitive compensation includes a salary range of $190,000 - $270,000, equity, and health benefits. #J-18808-Ljbffr AI Chopping Block, Inc.Senior- Cortes 23 in San Francisco is seeking a Senior Site Reliability Engineer to design and operate large-scale GPU infrastructure. This high-impact role requires deep expertise in distributed systems and a proactive approach to incident management. The successful candidate...SeniorRemote job
$160k - $225k
Cacheflow is seeking a Senior Software Engineer for AI Runtime at Databricks, located in San Francisco. You will be instrumental in building and scaling systems for large-scale GPU training, ensuring high throughput and resilience in training across expansive fleets of...Senior- Baseten is hiring a Network Engineer (Data Centers) in San Francisco to design and own the high-performance network infrastructure for their GPU clusters. This senior role collaborates closely with hardware and platform teams, directly impacting model performance and inference...SeniorFlexible hours
$250k
...opportunities? Join a rapidly scaling AI cloud infrastructure... ...a next-generation GPU platform designed for AI training... ...company is looking for a Senior / Staff Site Reliability Engineer to support and scale large... ...for GPU compute clusters Collaborate with ML, data...SeniorPermanent employmentRemote work$238k - $288k
...urgency, who believe in the scale of our ambition and... ...expands across new GPU and CPU server platforms... ...we're hiring a founding engineer to lead our BMC firmware... ...teams from schematics and hardware design docs. Own the... ...Comprehensive health, dental & vision insurance...SeniorTemporary work- A cutting-edge AI video platform is seeking a Senior Software Engineer (Infrastructure) to manage its GPU deployments and maintain a reliable AWS backbone. You will collaborate with specialized providers to ensure high availability and architect scalable systems, impacting...Senior
- ...AI research company in San Francisco is seeking a Software Engineer for the Fleet Hardware team. This role focuses on ensuring the reliability and... ...monitoring tools. Ideal candidates have experience with large-scale server environments and proficiency in languages like...
- Slope is seeking an experienced engineer to maintain system integrity for its supercomputers... ...include owning system health checks, diagnosing hardware failures, and developing automation... ...to minimize disruptions during large-scale operations. This role is crucial for...
- ...large distributed ML training and inference clusters Develop efficient, scalable end-to-end pipelines to manage petabyte-scale datasets and model training throughout the entire... ...scales Analyze, profile and debug low-level GPU operations to optimize performance Stay up-...Senior
- ...pipelines. Algorithmic Innovation: Refine and scale algorithms for our VPS system including... ...for our map of the world. Performance Engineering: Optimize complex ML and CV code for maximum... ...and high-performance execution on GPU/CPU. Benchmarking & Evaluation: Create and...SeniorWork at office3 days per week
$170k - $245k
...startup agility with HP’s global scale, we’re building intelligent... ...a diverse, world‑class team—engineers, designers, researchers, and... ...a multidisciplinary group of hardware, embedded software, and AI engineers... ...package, including: Health insurance Dental insurance...SeniorFull timeTemporary workLocal areaFlexible hours- ...Algorithmic Innovation: Refine and scale algorithms for our... ...map of the world. Performance Engineering: Optimize complex Computer Vision... ...high-performance execution on GPU/CPU. Benchmarking & Evaluation... ...disclose to Niantic Spatial, such as health or medical information, race...SeniorWork at office3 days per week
$248k - $310k
United States Digital Space LLC in San Francisco is seeking a Staff Engineer for their Safety Experience team. The ideal candidate will lead the development of safety products, working on complex projects that ensure compliance and protect users. Candidates should have...SeniorRelocation package$196k - $220.5k
Ultimate.ai is seeking a Product Manager to lead efforts in combating scaled abuse on Discord. You'll create strategies and coordinate with cross-functional teams while building user-facing features. The ideal candidate has strong product management experience and a passion...Senior$204k - $240k
About the role Samsara's Hardware Reliability team enables an exceptional customer... ...to resolve key issues. As a Senior Hardware Reliability Engineer, you will design quality processes that... ...Be Inclusive, Win as a Team) as we scale globally and across new offices. Minimum...Senior$133.58k - $224.5k
...transform their operations at scale. Working at Samsara means... ...About the role: Samsara’s Hardware Reliability team enables... ...key issues. Samsara’s Senior Hardware Reliability Engineer will design quality processes... ...remote and flexible working, health benefits, and much, much...SeniorFull timeWork at officeRemote workFlexible hours- A tech startup in AI is seeking a Senior Infrastructure Engineer in San Francisco, CA. This role involves building and scaling a GPU Cloud Marketplace, transforming raw GPUs into... ...orchestration and effective collaboration with hardware vendors. Strong skills in Terraform,...Senior
- ...technology infrastructure company in San Francisco is seeking an experienced engineer to manage and operate GPU clusters. The role requires over 5 years of hands-on experience, a deep understanding of hardware systems, and a passion for automating fleet operations. You will...Senior
- A cutting-edge AI technology company based in San Francisco is seeking a specialist to design and operate large-scale GPU infrastructure. This role requires expertise in deploying GPU systems for high-throughput inference and model performance optimization. The ideal candidate...Senior
- ...technology company in San Francisco is looking for a Senior Software Engineer to build scalable infrastructure for large‑scale training and fine-tuning of foundation models.... ...distributed training systems and optimize GPU utilization while collaborating with cross-functional...Senior
$200k - $400k
Inferact is seeking a dedicated cluster administration engineer to manage high-performance GPU compute infrastructure in San Francisco. This hands-on role focuses on optimizing system health and availability for engineering productivity. Ideal candidates will have substantial...SeniorRemote job- Hamilton Barnes Associates Limited is seeking a Senior ML Infrastructure Engineer to help build and scale Kubernetes-based machine learning platforms. This role focuses on workload orchestration, GPU scheduling, and ensuring system reliability, working with highly technical...Senior
- Hamilton Barnes Associates Limited is looking for a Senior Storage Engineer to support large-scale AI infrastructure in San Francisco. This role involves designing... ...scalable storage solutions for high-performance GPU platforms. The ideal candidate has extensive...SeniorRemote job
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Hardware Health Engineer - Scale GPU Clusters. Be the first to apply!
- senior hardware engineer San Francisco, CA
- computer engineer San Francisco, CA
- power electronics hardware engineer San Francisco, CA
- computer systems engineer San Francisco, CA
- hardware electronics engineer San Francisco, CA
- hardware engineer San Francisco, CA
- digital hardware engineer San Francisco, CA
- senior computer engineer San Francisco, CA
- hardware design engineer San Francisco, CA
- computer engineer full time San Francisco, CA
