Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Software Engineer, Fleet Hardware Health

OpenAI

About the team

The Fleet team at OpenAI supports the computing environment that powers our cutting-edge research and product development. We oversee large-scale systems that span data centers, GPUs, networking, and more, ensuring high availability, performance, and efficiency. Our work enables OpenAI’s models to operate seamlessly at scale, supporting both internal research and external products like ChatGPT. We prioritize safety, reliability, and responsible AI deployment over unchecked growth.

About the role

As a software engineer on the Fleet Hardware team, you will be responsible for the reliability and uptime of all of OpenAI’s compute fleet. Minimizing hardware failure is key to research training progress and stable services, as even a single hardware hiccup can cause significant disruptions. With increasingly large supercomputers, the stakes continue to rise.

Being at the forefront of technology means that we are often the pioneers in troubleshooting these state-of-the-art systems at scale. This is a unique opportunity to work with cutting-edge technologies and devise innovative solutions to maintain the health and efficiency of our supercomputing infrastructure.

Our team empowers strong engineers with a high degree of autonomy and ownership, as well as ability to effect change. This role will require a keen focus on system-level comprehensive investigations and the development of automated solutions. We want people who go deep on problems, investigate as thoroughly as possible, and build automation for detection and remediation at scale.

In this role, you will:

  • Build and maintain automation systems for provisioning and managing server fleets.

  • Develop tools to monitor server health, performance, and lifecycle events.

  • Collaborate with clusters, networking, and infrastructure teams.

  • Partner with external operators to ensure a high level of quality.

  • Identify and fix performance bottlenecks and inefficiencies.

  • Continuously improve automation to reduce manual work.

You might thrive in this role if you have:

  • Experience managing large-scale server environments.

  • A balance of strengths in building and operationalizing.

  • Proficiency in Python, Go, or similar languages.

  • Strong Linux, networking, and server hardware knowledge.

  • Comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool.

Prior hardware expertise is not required for this role.

Bonus Skills:

  • Experience with low level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning)

  • Knowledge of hardware management protocols (e.g., IPMI, Redfish).

  • High-performance computing (HPC) or distributed systems experience.

  • Prior experience developing, managing, or designing hardware.

  • Familiarity with monitoring tools (e.g., Prometheus, Grafana).

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement .

Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.

To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form . No response will be provided to inquiries unrelated to job posting compliance.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link .

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

#J-18808-Ljbffr

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Software Engineer, Fleet Hardware Health in San Francisco, CA vacancy
  • $250k

    About the Team The Hardware Health and Observability team owns the end-...  ...of OpenAI's global compute fleet. Our mission is to maximize...  ...and product teams. Engineers on this team own problems end...  ...years of industry experience in software or infrastructure engineering... 
    Fleet

    OpenAI

    San Francisco, CA
    20 hours ago
  • $250k

    Software Engineer, Hardware Health Frontiers Clusters - San Francisco About the Team The Hardware Health and Observability team owns the end-to-end health lifecycle of OpenAI’s global compute fleet. Our mission is to maximize healthy, usable compute across accelerator... 
    Fleet

    OpenAI

    San Francisco, CA
    3 days ago
  • $225k

    About the Team OpenAI's Hardware organization develops silicon and...  ...silicon while working closely with software and research partners to co-...  ...the Role As a software engineer on the Scaling team, you'll help...  ...on our evolving hardware fleet. This role is based in San... 
    Fleet
    Work at office
    Local area
    Relocation package
    3 days per week

    OpenAI

    San Francisco, CA
    2 days ago
  • $250k

    OpenAI is seeking a Software Engineer for Hardware Health in San Francisco. The role involves maintaining the health of compute clusters, building automated systems for monitoring hardware, and ensuring efficient operations across large-scale distributed environments. Candidates... 
    Suggested

    OpenAI

    San Francisco, CA
    20 hours ago
  •  ...working systems and build any software needed for running large-...  ...edge AI research. Even a single hardware failure can derail a large-scale...  ...is core to the mission. Engineers here own their work end-to-end...  ...Own and improve the system health checks that keep our hyperscale... 
    Suggested

    Slope

    San Francisco, CA
    4 days ago
  • $180k - $250k

     ...You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You...  ...including provisioning, health monitoring, error detection, and...  ...dashboards, and alerting for hardware health across the fleet (GPU errors... 
    Fleet
    Local area
    Remote work
    Relocation package

    Fal

    San Francisco, CA
    3 days ago
  • $200k - $240k

     ...veteran operators and engineers, alumni of Sonos, Paypal...  ...We're looking for a Software Engineer, Build Infrastructure...  ...and comprehensive fleet-wide observability....  ...debugging of fleet health metrics like uptime and...  ...vehicle, or consumer hardware space. ~ Deep technical... 
    Fleet
    Local area
    Remote work

    Sauron

    San Francisco, CA
    2 days ago
  • $175k - $195k

     ...We’re looking for a Senior Software Engineer to lead the development of systems...  ...that manage our growing fleet of devices - the foundation...  ...you build will empower Fleet Health operators to monitor device...  ...the intersection of software, hardware, and operations - perfect... 
    Fleet

    Gridware Technologies Inc.

    San Francisco, CA
    4 days ago
  • $140k - $170k

     ...fish weights, detect the health status, and generate...  ...three levels: on-site hardware for image capture, cloud...  ...role As a Platform Engineer , you will be...  ...support a rapidly growing fleet of remote cameras. You...  ...optimization ~ Strong software engineering skills; knowledge... 
    Fleet
    Immediate start
    Remote work
    Flexible hours

    OpenReq

    San Francisco, CA
    20 hours ago
  • A leading AI research company in San Francisco is seeking a Software Engineer for the Fleet Hardware team. This role focuses on ensuring the reliability and uptime of compute fleets, minimizing hardware failures. Responsibilities include building automation systems, collaborating... 
    Fleet

    OpenAI

    San Francisco, CA
    4 days ago
  • AeroVect Technologies Inc. in South San Francisco is seeking a Reliability Engineer to establish reliability engineering processes that enhance fleet health. Responsibilities include leading reliability analyses like FMEA, FTA, and RBD and tracking critical metrics for... 
    Fleet

    AeroVect Technologies Inc.

    South San Francisco, CA
    2 days ago
  •  ...technology investors in the world (funded notable health tech companies such as GoodRx, Oscar...  ...The Role We’re hiring an Applied AI Software Engineer to lead evaluations for agents in development and the post-deployment fleet of agents operating in Canvas to automate... 
    Fleet
    Remote work
    Home office
    Flexible hours

    Canvas Medical

    San Francisco, CA
    3 days ago
  •  ...building the most advanced hardware, software, and AI technology to make it...  ...professional athletes, and health-conscious consumers in over...  ...a Senior Embedded Software Engineer to help us bring current and...  ...work will go directly to our fleet of existing Pods with low friction... 
    Fleet
    Full time
    Work at office
    Immediate start
    Worldwide
    Flexible hours
    Night shift

    Eight Sleep

    San Francisco, CA
    4 days ago
  • $130k - $190k

     ...fish weights, detect the health status, and generate...  ...three levels: on-site hardware for image capture, cloud...  ...Systems Team: Edge engineering is responsible for the hardware and software orchestrating the hardware...  ...Build and maintain fleet operations tools for monitoring... 
    Fleet
    Work at office
    Immediate start
    Remote work
    Flexible hours

    Aquabyte

    San Francisco, CA
    3 days ago
  • $405k

     ...committed researchers, engineers, policy experts, and...  ...hosts, and build the health, diagnostics and repair...  ...Trainium node in the fleet usable and ready to power...  ...remediate unhealthy hardware automatically, driving...  ...Qualifications ~12+ years of software engineering experience... 
    Fleet
    Work at office
    Visa sponsorship
    Flexible hours

    Colorwave Inc

    San Francisco, CA
    9 hours ago
  • Requirements BS in Mechatronics Engineering, Electrical Engineering, Computer Engineering...  ...like NATS , Understanding of how hardware changes affect robotics software and vice-versa , Understanding of...  ...safety management of our robotic fleet as we scale , You will collaborate... 
    Fleet

    Mytra

    San Francisco, CA
    1 day ago
  •  ...patient care across national health systems → 40% better...  ...As our Senior IT Systems Engineer, you'll own the corporate technology...  ...team -- triaging and resolving hardware and software issues with pragmatism and...  ...Intune) for Mac and Windows fleets. Experience with Google Workspace... 
    Fleet

    Brainco

    San Francisco, CA
    20 hours ago
  • $266k

    About the Team OpenAI's Hardware organization develops silicon, systems...  ...We're seeking a Security Engineer to join our First-Party...  ...readiness, deployment readiness, fleet operations, and incident response...  .... Comfort with hardware-software interfaces such as SPI, I2C,... 
    Fleet
    Contract work
    Relocation package
    3 days per week

    OpenAI

    San Francisco, CA
    2 days ago
  • $150k - $215k

     ...As a Space Infrastructure Software Engineer, you are responsible for scaling...  ...to operate a heterogeneous fleet of satellites—spanning...  ...Director, responsible for the health and safety of our fleet....  ...across the full software and hardware stack—including ground-based... 
    Fleet
    Temporary work
    Work at office
    Relocation package
    Flexible hours

    Loft Orbital Solutions

    San Francisco, CA
    1 day ago
  • $266k

     ...employer contributions to Health Savings Accounts Pre-...  ...the Team OpenAI’s Hardware organization develops silicon...  ...working closely with software and research partners...  ...seeking a System Software Engineer to join our First-Party...  ...and manufacturing and fleet readiness. A major part... 
    Fleet
    Full time
    Work at office
    Local area
    Remote work
    Relocation package
    Flexible hours
    3 days per week

    Slope

    San Francisco, CA
    1 day ago
  •  ...immediately advance our large fleet of autonomous vehicles...  ..., the Sensor Health team's job is to make...  ...entire self-driving car software stack. We make sure that...  ...closely with both hardware and software teams to...  ...and build a team of ML engineers in charge of reliable... 
    Fleet
    Full time
    Work at office
    Immediate start
    Remote work

    Waymo

    San Francisco, CA
    3 days ago
  •  ...Performs as a key contributor to an engineering team that builds and supports...  ...activities on application software; this may often require...  ...and monitoring of production health. ¿ Produces complete, simple,...  ...impact assessment of product (hardware, software) upgrades ¿ Assists... 

    Procyon TS

    San Francisco, CA
    20 hours ago
  •  ...Senior Product Engineer Lunar is a stealth technology company building a new type of software platform for health systems. We are on a mission to revolutionize healthcare with...  ...Bridge the gap between software and hardware: Architect a next-generation integration... 
    Remote work
    Flexible hours
    3 days per week

    Lunar GMBH

    San Francisco, CA
    10 days ago
  •  ...About Flow Flow Engineering is an AI-native requirements platform...  ...We're reimagining how complex hardware is built by pairing world-...  ...is hiring a senior frontend software engineer to own core user experiences...  ...and meaningful equity. Health, dental, and vision coverage.... 
    Flexible hours

    Flow Engineering

    San Francisco, CA
    4 days ago
  • $140k - $170k

     ...quantify fish weights, detect the health status, and generate optimal...  ...at three levels: on-site hardware for image capture, cloud pipelines...  ...looking for a Senior Backend Engineer to build and operate the...  ...gstreamer, FCR, FFmpeg ~ Strong software engineering skills; knowledge... 
    Immediate start
    Remote work
    Flexible hours

    Aquabyte

    San Francisco, CA
    20 hours ago
  •  ...Flow Engineering Job Posting Flow Engineering is an AI-native requirements...  ...organizations, enabling hardware teams to collaborate with AI...  ...is seeking full stack senior software engineers to build AI-powered...  ...and meaningful equity. Health, dental, and vision coverage.... 
    Flexible hours

    Flow Engineering

    San Francisco, CA
    20 hours ago
  •  ...The Fleet team at OpenAI supports the computing environment that powers our cutting-edge...  ...growth. About the Role The Software Engineer, Operating Systems & Orchestration will focus on building systems to manage hardware, configurations, vendors, and the people... 
    Fleet
    Work at office
    Relocation package

    OpenAI

    San Francisco, CA
    1 day ago
  • $187.5k - $395k

     ...Software Engineer, Inference Luma's mission is to build multimodal AI to expand human imagination...  ...workloads across different clusters & hardware providers Build sophisticated...  ...with queues, scheduling, traffic-control, fleet management at scale ~ Experience with... 
    Fleet

    Luma AI

    San Francisco, CA
    20 hours ago
  • $175k - $215k

     ...Software Engineer, Driving Behaviors Waymo is an autonomous driving technology company with...  ...team works together to blend software and hardware systems in groundbreaking new ways. We...  ...Integrate and deploy metrics and models on fleet-wide data You have: ~5+... 
    Fleet
    Full time
    Remote work

    Waymo

    San Francisco, CA
    3 days ago
  • $293k

     ...responsible for the architectural and engineering backbone of OpenAI's...  ...AI models. Our work spans system software, networking, platform architecture, fleet-level monitoring, and performance...  ...sometimes early-access, systems/hardware, analyzing performance and bottlenecks... 
    Fleet

    OpenAI

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Software Engineer, Fleet Hardware Health. Be the first to apply!