Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Software Engineer, Hardware Health

$250k

OpenAI

Software Engineer, Hardware Health Frontiers Clusters - San Francisco About the Team The Hardware Health and Observability team owns the end-to-end health lifecycle of OpenAI’s global compute fleet. Our mission is to maximize healthy, usable compute across accelerator vendors, generations, cloud providers, and regions through reliable health signals, automated remediation, and scalable operational tooling. We build the systems that observe, detect, remediate, and verify hardware issues across GPUs, CPUs, networking, and platform infrastructure, enabling frontier model training and inference workloads to run reliably at hyperscale. We are the last line of defense for the success of OAI’s production and research workloads. About the Role On the Hardware Health and Observability team, you’ll build critical infrastructure that keeps OpenAI’s largest compute clusters healthy and operational at scale. Even small numbers of unhealthy systems can impact large-scale training and inference workloads. This team focuses on minimizing downtime, improving fleet efficiency, and ensuring compute resources remain continuously available to researchers and product teams. Engineers on this team own problems end-to-end, from defining health signals and debugging failures to building automated remediation systems that operate across millions of GPUs globally. Responsibilities Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure. Build and evolve health checks that detect, remediate, and verify failures at scale. Ensure critical health checks execute with minimal latency to maximize workload uptime. Investigate hardware failures and system‑level issues across large-scale compute environments. Own node lifecycle workflows including drain, quarantine, repair, RMA, and return‑to‑service processes. Build automation and tooling that enables global cluster management with minimal manual intervention. Partner with workload, reliability, and provider teams to integrate health signals into training and inference systems. Qualifications 7+ years of industry experience in software or infrastructure engineering. Strong proficiency with Python and shell scripting. Experience building large‑scale distributed systems or infrastructure platforms. Comfort digging into noisy operational data using SQL, PromQL, or similar tooling. Experience building reproducible analyses and operational tooling. Strong systems debugging and operational instincts with an ownership mindset. Bonus if you have Experience with low‑level hardware systems and Linux tooling (e.g., PCIe, InfiniBand, RoCE, networking, power management, kernel performance tuning, FW/SW debugging). Experience operating or debugging large‑scale GPU or accelerator clusters. Expertise in network operations, observability, or systems telemetry. Experience with automated remediation systems or fleet lifecycle management. Experience improving reliability, utilization, or workload uptime in distributed compute environments. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general‑purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. Background checks for applicants will be administered in accordance with applicable law, and qualified applicants with arrest or conviction records will be considered for employment consistent with those laws, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, for U.S.‑based candidates. For unincorporated Los Angeles County workers, criminal history may be considered in relation to duties such as protecting computer hardware, returning all hardware upon termination, and safeguarding proprietary information. Compensation $250K – $445K + equity #J-18808-Ljbffr

Vacancy posted 8 hours ago
Similar jobs that could be interesting for youBased on the Software Engineer, Hardware Health in San Francisco, CA vacancy
  • $250k

     ...OpenAI is seeking a Software Engineer for Hardware Health in San Francisco. The role involves maintaining the health of compute clusters, building automated systems for monitoring hardware, and ensuring efficient operations across large-scale distributed environments.... 
    Suggested

    OpenAI

    San Francisco, CA
    8 hours ago
  •  ...working systems and build any software needed for running large-...  ...edge AI research. Even a single hardware failure can derail a large-scale...  ...is core to the mission. Engineers here own their work end-to-end...  ...: Own and improve the system health checks that keep our hyperscale... 
    Suggested

    Slope

    San Francisco, CA
    8 hours ago
  • $250k

    About the Team The Hardware Health and Observability team owns the end-to-end health lifecycle...  ...to researchers and product teams. Engineers on this team own problems end-to-end, from...  ...7+ years of industry experience in software or infrastructure engineering. Strong... 
    Suggested

    OpenAI

    San Francisco, CA
    3 days ago
  •  ...deployment over unchecked growth. About the role As a software engineer on the Fleet Hardware team, you will be responsible for the reliability and...  ...and devise innovative solutions to maintain the health and efficiency of our supercomputing infrastructure.... 
    Suggested
    Full time

    OpenAI

    San Francisco, CA
    1 day ago
  • $310k

     ...About the Team OpenAI's Hardware organization develops silicon and system-level solutions...  ...native silicon while working closely with software and research partners to co-design...  ...specifically for AI. About the Role As a software engineer on the Scaling team, you'll help build... 
    Suggested
    Work at office
    Local area
    Relocation package
    3 days per week

    Slope

    San Francisco, CA
    8 hours ago
  • $175k - $195k

     ...Description We’re looking for a Senior Software Engineer to lead the development of systems that...  ...technology you build will empower Fleet Health operators to monitor device performance...  ...role at the intersection of software, hardware, and operations - perfect for engineers... 

    Gridware Technologies Inc.

    San Francisco, CA
    7 hours ago
  • $140k - $170k

     ...quantify fish weights, detect the health status, and generate optimal...  ...at three levels: on-site hardware for image capture, cloud pipelines...  ...looking for a Senior Backend Engineer to build and operate the...  ...WebRTC, FFmpeg, gstreamer Strong software engineering skills; knowledge... 
    Immediate start
    Remote work
    Flexible hours

    Aquabyte

    San Francisco, CA
    8 hours ago
  •  ...technology company building a new type of software platform for health systems. We are on a mission to...  ...We are looking for a Senior Product Engineer who’s excited to tackle some of the hardest...  ...Bridge the gap between software and hardware : Architect a next‑generation... 
    Remote work
    Flexible hours
    3 days per week

    Lunar GMBH

    San Francisco, CA
    1 day ago
  • $160k - $190k

     ...are seeking a full-time Senior Robotics Software Engineer to enhance the performance and...  ...collaborate closely with teams across Hardware, Infrastructure, and Machine Learning to...  ...an equal opportunity employer offering Health, dental, vision, and commuter benefits... 
    Full time
    Immediate start

    King River Capital Group

    San Francisco, CA
    7 hours ago
  •  ...thinking technology company in San Francisco is seeking a Senior Software Engineer to develop the next generation of AI systems. The ideal...  ...working in a fully remote environment. Prior experience in hardware or electronics is not required, as the company values diverse... 
    Remote work

    Jobleads-US

    San Francisco, CA
    a month ago
  • $150k - $215k

     ...Horowitz to Blackrock and Fidelity, and employs a team of 450 engineers and entrepreneurs. Astranis designs, builds, and...  ...ft. headquarters in Northern California, USA. SENIOR SOFTWARE ENGINEER - HARDWARE TEST We are seeking a highly skilled Senior Software Engineer... 
    Permanent employment
    Flexible hours
    Rotating shift

    Astranis

    San Francisco, CA
    3 days ago
  •  ...Flow Engineering Job Flow Engineering is an AI-native requirements...  ...engineering organizations, enabling hardware teams to collaborate with AI...  ...is seeking full stack senior software engineers to build AI-powered...  ...and meaningful equity. Health, dental, and vision coverage.... 
    Flexible hours

    Flow Engineering

    San Francisco, CA
    3 days ago
  •  ...About Flow Flow Engineering is an AI-native requirements platform...  ...We're reimagining how complex hardware is built by pairing world-...  ...is hiring a senior frontend software engineer to own core user experiences...  ...and meaningful equity. Health, dental, and vision coverage.... 
    Flexible hours

    Flow Engineering

    San Francisco, CA
    2 days ago
  •  ...Performs as a key contributor to an engineering team that builds and supports...  ...activities on application software; this may often require...  ...and monitoring of production health. ¿ Produces complete, simple,...  ...impact assessment of product (hardware, software) upgrades ¿ Assists... 

    Procyon TS

    San Francisco, CA
    3 days ago
  • $180k - $250k

     ...You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers...  ...of servers including provisioning, health monitoring, error detection, and recovery...  ..., dashboards, and alerting for hardware health across the fleet (GPU errors,... 
    Local area
    Remote work
    Relocation package

    Fal

    San Francisco, CA
    1 day ago
  • $200k - $240k

     ...led by veteran operators and engineers, alumni of Sonos, Paypal, Tesla...  ...We're looking for a Software Engineer, Build Infrastructure...  ...Sentry), and debugging of fleet health metrics like uptime and resource...  ...autonomous vehicle, or consumer hardware space. ~ Deep technical... 
    Local area
    Remote work

    Sauron

    San Francisco, CA
    5 days ago
  •  ...About Flow Flow Engineering is an AI-native requirements platform...  ...engineering organizations, enabling hardware teams to collaborate with AI...  ...role Flow is hiring a Software Engineer with an...  ...salary and meaningful equity. Health, dental, and vision coverage.... 
    Flexible hours

    Flow Engineering

    San Francisco, CA
    3 days ago
  • $125k - $195k

     ...Atomic Semi is searching for a Robotics Software Engineer in San Francisco, California. The role requires building algorithm-rich software...  ...tools, demanding deep technical challenges in robotics and hardware integration. Successful candidates will have strong programming... 

    Atomic Semi

    San Francisco, CA
    1 day ago
  •  ...platform is vital to our mission. That's why we're seeking a software engineer to help us build out our trust and safety capabilities. In...  ...withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all... 

    OpenAI

    San Francisco, CA
    4 days ago
  • $140k - $170k

     ...quantify fish weights, detect the health status, and generate optimal...  ...at three levels: on-site hardware for image capture, cloud pipelines...  ...team. The role As a Platform Engineer, you will be responsible for...  ...and optimization Strong software engineering skills; knowledge... 
    Immediate start
    Remote work
    Flexible hours

    Aquabyte

    San Francisco, CA
    1 day ago
  • $342k

     ...with employer contributions to Health Savings Accounts Pre-tax...  ...conditions. About the Team OpenAI’s Hardware organization develops silicon...  ...while working closely with software and research partners to co-...  ...AI. About the Role As an Engineer on our hardware optimization... 
    Full time
    Work at office
    Local area
    Relocation package
    Flexible hours

    Centaur Labs

    San Francisco, CA
    4 days ago
  • $180k - $230k

     ...generative AI solutions for the healthcare revenue cycle, we help health systems comprehensively capture and communicate the full...  ...to help us accelerate that reality. About the Role As a Sr. Software Engineer, Backend, you will be building AI-powered solutions across our... 
    Home office
    Flexible hours

    AKASA

    San Francisco, CA
    7 hours ago
  • $130k - $155k

     ...pregnancy care data platform to improve maternal and child health outcomes. About You: As a backend engineer with Delfina, you will have the opportunity to...  ...debugging to ensure the delivery of high-quality software. Stay updated with the latest trends and advancements... 
    Full time
    Flexible hours

    Clutch Canada

    San Francisco, CA
    1 day ago
  • $108.7k - $181.1k

     ...accessible and affordable. Here, we focus on the health, happiness, and well-being of you and...  ...from you. Role Summary Ontada's Engineering team builds iKnowMed (iKM), the leading...  ...trial matching. We are hiring a Software Engineer III (P3) to design and build well... 
    Work experience placement
    Work at office
    Remote work
    2 days per week

    McKesson

    San Francisco, CA
    4 days ago
  • $150k - $170k

     ...like by solving these issues through our software platform (SaaS). We combine cutting edge...  ...is committed to improving the lives and health of complex patients that have an...  ...looking for a Senior Full Stack Software Engineer who is excited about leveraging AI to drive... 
    Live in
    Remote work

    Arine

    San Francisco, CA
    5 days ago
  •  ...Employment Type: Full-time Department: Engineering Reports to: Head of Engineering About Teal Health: Teal Health is on a mission to provide women...  ...This Role: We are seeking a senior full-stack software engineer to design, build, and maintain... 
    Full time
    Work at office
    Flexible hours

    Teal Health, Inc

    San Francisco, CA
    1 day ago
  • About Onos Health Onos Health’s mission is simple but ambitious: ensure every healthcare...  ...seeking an experienced Senior Backend Engineer who is motivated to meaningfully improve...  ...looking for: 5+ years experience building software applications Prior experience leading... 
    Work at office
    Remote work
    Home office
    Flexible hours
    2 days per week
    3 days per week

    Onos

    San Francisco, CA
    1 day ago
  • $115k - $150k

     ...Horowitz to Blackrock and Fidelity, and employs a team of 450 engineers and entrepreneurs. Astranis designs, builds, and...  ...00 sq. ft. headquarters in Northern California, USA. Hardware/Production Test Software Engineer We are seeking a highly skilled and motivated... 
    Permanent employment
    Flexible hours

    Astranis

    San Francisco, CA
    1 day ago
  • $266k

     ...Slope is seeking a System Software Engineer in San Francisco to design and validate low-level system software for AI hardware. The role involves managing firmware development, integrating partner software, and debugging across hardware interfaces. Ideal candidates have... 

    Slope

    San Francisco, CA
    7 hours ago
  • $266k

     ...with employer contributions to Health Savings Accounts Pre-tax...  ...conditions. About the Team OpenAI’s Hardware organization develops silicon...  ...while working closely with software and research partners to co-...  ...re seeking a System Software Engineer to join our First-Party... 
    Full time
    Work at office
    Local area
    Remote work
    Relocation package
    Flexible hours
    3 days per week

    Slope

    San Francisco, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer, Hardware Health. Be the first to apply!