Software Engineer, Fleet Hardware Health
OpenAI
About the team
The Fleet team at OpenAI supports the computing environment that powers our cutting-edge research and product development. We oversee large-scale systems that span data centers, GPUs, networking, and more, ensuring high availability, performance, and efficiency. Our work enables OpenAI’s models to operate seamlessly at scale, supporting both internal research and external products like ChatGPT. We prioritize safety, reliability, and responsible AI deployment over unchecked growth.
About the role
As a software engineer on the Fleet Hardware team, you will be responsible for the reliability and uptime of all of OpenAI’s compute fleet. Minimizing hardware failure is key to research training progress and stable services, as even a single hardware hiccup can cause significant disruptions. With increasingly large supercomputers, the stakes continue to rise.
Being at the forefront of technology means that we are often the pioneers in troubleshooting these state-of-the-art systems at scale. This is a unique opportunity to work with cutting-edge technologies and devise innovative solutions to maintain the health and efficiency of our supercomputing infrastructure.
Our team empowers strong engineers with a high degree of autonomy and ownership, as well as ability to effect change. This role will require a keen focus on system-level comprehensive investigations and the development of automated solutions. We want people who go deep on problems, investigate as thoroughly as possible, and build automation for detection and remediation at scale.
In this role, you will:
Build and maintain automation systems for provisioning and managing server fleets.
Develop tools to monitor server health, performance, and lifecycle events.
Collaborate with clusters, networking, and infrastructure teams.
Partner with external operators to ensure a high level of quality.
Identify and fix performance bottlenecks and inefficiencies.
Continuously improve automation to reduce manual work.
You might thrive in this role if you have:
Experience managing large-scale server environments.
A balance of strengths in building and operationalizing.
Proficiency in Python, Go, or similar languages.
Strong Linux, networking, and server hardware knowledge.
Comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool.
Prior hardware expertise is not required for this role.
Bonus Skills:
Experience with low level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning)
Knowledge of hardware management protocols (e.g., IPMI, Redfish).
High-performance computing (HPC) or distributed systems experience.
Prior experience developing, managing, or designing hardware.
Familiarity with monitoring tools (e.g., Prometheus, Grafana).
About OpenAI
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.
We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or any other legally protected status.
For US Based Candidates: Pursuant to the San Francisco Fair Chance Ordinance, we will consider qualified applicants with arrest and conviction records.
We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link .
At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.
$250k
About the Team The Hardware Health and Observability team owns the end-... ...of OpenAI's global compute fleet. Our mission is to maximize... ...and product teams. Engineers on this team own problems end... ...years of industry experience in software or infrastructure engineering...Fleet$250k
Software Engineer, Hardware Health Frontiers Clusters - San Francisco About the Team The Hardware Health and Observability team owns the end-to-end health lifecycle of OpenAI’s global compute fleet. Our mission is to maximize healthy, usable compute across accelerator...Fleet$225k
About the Team OpenAI's Hardware organization develops silicon and... ...silicon while working closely with software and research partners to co-... ...the Role As a software engineer on the Scaling team, you'll help... ...on our evolving hardware fleet. This role is based in San...FleetWork at officeLocal areaRelocation package3 days per week$180k - $250k
...You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You... ...including provisioning, health monitoring, error detection, and... ...dashboards, and alerting for hardware health across the fleet (GPU errors...FleetLocal areaRemote workRelocation package$200k - $240k
...veteran operators and engineers, alumni of Sonos, Paypal... ...We're looking for a Software Engineer, Build Infrastructure... ...and comprehensive fleet-wide observability.... ...debugging of fleet health metrics like uptime and... ...vehicle, or consumer hardware space. ~ Deep technical...FleetLocal areaRemote work$250k
OpenAI is seeking a Software Engineer for Hardware Health in San Francisco. The role involves maintaining the health of compute clusters, building automated systems for monitoring hardware, and ensuring efficient operations across large-scale distributed environments. Candidates...- ...working systems and build any software needed for running large-... ...edge AI research. Even a single hardware failure can derail a large-scale... ...is core to the mission. Engineers here own their work end-to-end... ...Own and improve the system health checks that keep our hyperscale...
$175k - $195k
...We’re looking for a Senior Software Engineer to lead the development of systems... ...that manage our growing fleet of devices - the foundation... ...you build will empower Fleet Health operators to monitor device... ...the intersection of software, hardware, and operations - perfect...Fleet$140k - $170k
...fish weights, detect the health status, and generate... ...three levels: on-site hardware for image capture, cloud... ...The role As a Platform Engineer, you will be responsible... ...support a rapidly growing fleet of remote cameras. You... ...optimization Strong software engineering skills; knowledge...FleetImmediate startRemote workFlexible hours- AeroVect Technologies Inc. in South San Francisco is seeking a Reliability Engineer to establish reliability engineering processes that enhance fleet health. Responsibilities include leading reliability analyses like FMEA, FTA, and RBD and tracking critical metrics for...Fleet
- ...building the most advanced hardware, software, and AI technology to make it... ...professional athletes, and health-conscious consumers in over... ...a Senior Embedded Software Engineer to help us bring current and... ...work will go directly to our fleet of existing Pods with low friction...FleetFull timeWork at officeImmediate startWorldwideFlexible hoursNight shift
- ...technology investors in the world (funded notable health tech companies such as GoodRx, Oscar... ...The Role We’re hiring an Applied AI Software Engineer to lead evaluations for agents in development and the post-deployment fleet of agents operating in Canvas to automate...FleetRemote workHome officeFlexible hours
$320k - $405k
...Staff Infrastructure Engineer, Node Infra Anthropic... ...of hosts, and build the health, diagnostics and repair... ...and Trainium node in the fleet usable and ready to... ...and remediate unhealthy hardware automatically, driving... ...Qualifications ~8+ years of software engineering experience,...FleetWork at officeVisa sponsorshipFlexible hours$266k
System Software Engineer, First-Party Hardware Hardware - San Francisco OpenAI’s Hardware organization develops... ...software for the manageability and health of OpenAI's first‑party AI hardware... ...software interfaces, and manufacturing and fleet readiness. A major part of this role...FleetRemote workRelocation package3 days per week- ...immediately advance our large fleet of autonomous vehicles... ..., the Sensor Health team's job is to make... ...entire self-driving car software stack. We make sure that... ...closely with both hardware and software teams to... ...and build a team of ML engineers in charge of reliable...FleetFull timeWork at officeImmediate startRemote work
$155k - $190k
...Senior Backend/Infrastructure Software Engineer We are searching for a... ...manage and support our growing fleet of autonomous robots... ...and dedicated software and hardware engineers which match well with... ...opportunity employer who offers Health, dental, vision, phone and...FleetFull timeWork at officeImmediate start- ...Flow Engineering Job Flow Engineering is an AI-native requirements... ...engineering organizations, enabling hardware teams to collaborate with AI... ...is seeking full stack senior software engineers to build AI-powered... ...and meaningful equity. Health, dental, and vision coverage....Flexible hours
$140k - $170k
...quantify fish weights, detect the health status, and generate optimal... ...at three levels: on-site hardware for image capture, cloud pipelines... ...looking for a Senior Backend Engineer to build and operate the... ...gstreamer, FCR, FFmpeg ~ Strong software engineering skills; knowledge...Immediate startRemote workFlexible hours- ...thinking technology company in San Francisco is seeking a Senior Software Engineer to develop the next generation of AI systems. The ideal... ...working in a fully remote environment. Prior experience in hardware or electronics is not required, as the company values diverse...Remote work
$150k - $215k
...Horowitz to Blackrock and Fidelity, and employs a team of 450 engineers and entrepreneurs. Astranis designs, builds, and... ...ft. headquarters in Northern California, USA. SENIOR SOFTWARE ENGINEER - HARDWARE TEST We are seeking a highly skilled Senior Software Engineer...Permanent employmentFlexible hoursRotating shift- ...About Flow Flow Engineering is an AI-native requirements platform... ...We're reimagining how complex hardware is built by pairing world-... ...is hiring a senior frontend software engineer to own core user experiences... ...and meaningful equity. Health, dental, and vision coverage....Flexible hours
- ...Performs as a key contributor to an engineering team that builds and supports... ...activities on application software; this may often require... ...and monitoring of production health. ¿ Produces complete, simple,... ...impact assessment of product (hardware, software) upgrades ¿ Assists...
- ...Senior Product Engineer Lunar is a stealth technology company building a new type of software platform for health systems. We are on a mission to revolutionize healthcare with... ...Bridge the gap between software and hardware: Architect a next-generation integration...Remote workFlexible hours3 days per week
- ...The Fleet team at OpenAI supports the computing environment that powers our cutting-edge... ...growth. About the Role The Software Engineer, Operating Systems & Orchestration will focus on building systems to manage hardware, configurations, vendors, and the people...FleetWork at officeRelocation package
$187.5k - $395k
...Software Engineer, Inference Luma's mission is to build multimodal AI to expand human imagination... ...workloads across different clusters & hardware providers Build sophisticated... ...with queues, scheduling, traffic-control, fleet management at scale ~ Experience with...Fleet$175k - $215k
...Software Engineer, Driving Behaviors Waymo is an autonomous driving technology company with... ...team works together to blend software and hardware systems in groundbreaking new ways. We... ...Integrate and deploy metrics and models on fleet-wide data You have: ~5+...FleetFull timeRemote work$293k
...responsible for the architectural and engineering backbone of OpenAI's... ...AI models. Our work spans system software, networking, platform architecture, fleet-level monitoring, and performance... ...sometimes early-access, systems/hardware, analyzing performance and bottlenecks...Fleet- ...Role We are looking for an engineer who wants to take the world's... .... Optimize our code and fleet of Azure VMs to utilize every... ...every GB of GPU RAM of our hardware. You might thrive in this... ...least 5 years of professional software engineering experience. Have...Fleet
$320k - $405k
...Software Engineer, Compute Efficiency San Francisco, CA | New York City, NY About Anthropic... ...-level research needs and low-level hardware constraints to build the most efficient... ...costs across our cloud and datacenter fleets. Design and implement cost attribution...FleetWork at officeVisa sponsorshipFlexible hours$165.2k - $223.6k
....Our cross-functional team, spanning hardware, software, and manufacturing, develops advanced... ...responsibilities - As a software development engineer on the fleet management team, you will use your... ...comprehensive benefits including health insurance (medical, dental, vision,...FleetInternshipLocal areaWorldwideFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Software Engineer, Fleet Hardware Health. Be the first to apply!
- graduate software developer San Francisco, CA
- rust software engineer San Francisco, CA
- senior software design engineer San Francisco, CA
- software engineer student San Francisco, CA
- software engineer amazon San Francisco, CA
- software developer positions San Francisco, CA
- software engineer full time San Francisco, CA
- software qa engineer San Francisco, CA
- new graduate software engineer San Francisco, CA
- junior software developer San Francisco, CA


