Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Reliability/DFX Engineer

OpenAI

About the Team OpenAI’s Hardware organization develops silicon and system-level solutions designed for the unique demands of advanced AI workloads. The team is responsible for building the next generation of AI-native silicon while working closely with software and research partners to co-design hardware tightly integrated with AI models. In addition to delivering production-grade silicon for OpenAI’s supercomputing infrastructure, the team also creates custom design tools and methodologies that accelerate innovation and enable hardware optimized specifically for AI. About the Role We are seeking a highly skilled cross-stack engineer with deep expertise in making ML systems reliable at scale. This hands-on individual contributor will sit within our hardware team and work closely with chip design, platform design, hardware health, and the broader industry ecosystem to architect, implement, and deploy reliable next-generation AI accelerator systems. This engineer will evaluate system and chip architecture holistically, identify high-ROI opportunities to improve reliability and availability across the stack, and translate those opportunities into strategy and silicon features. In this role, you will Oversee DFX architecture, implementation, and execution in silicon from concept to high-volume deployment, and propose high-ROI features to enhance reliability and fault tolerance. DFX includes design for testability, reliability, availability, and serviceability of high-performance AI hardware. Build system-level reliability models grounded in empirical data to guide organization-wide DFX and reliability strategy. This requires a detailed understanding of chip and system architecture, design, implementation, and component-level reliability. Collaborate with chip and platform architecture/design teams to explore and implement DFX features, including the specification and implementation of digital/mixed-signal IP, firmware/system software, and DFX methodology (in partnership with engineering teams). Partner with hardware health and platform design teams to continuously improve reliability and fault tolerance in NPI and HVM. This includes optimizing operating conditions, designing experiments, and performing data analysis to drive continuous, data-driven improvements across the stack. Serve as the DFX/reliability champion and evangelist to align the broader industry ecosystem with OpenAI’s requirements and roadmap. Qualifications BS with 15+ years, MS with 10+ years, or PhD with 3+ years of relevant industry experience focused on reliability across the chip/platform stack. Hands-on experience with RTL design and DFT is required; physical implementation and/or silicon ATE experience is preferred. Detailed understanding of ML chip and platform architecture and ML workload characteristics is required. Strong fundamentals in reliability modeling, with hands-on skills in empirical data analysis. About OpenAI OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity. We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic. For additional information, please see OpenAI’s Affirmative Action and Equal Employment Opportunity Policy Statement. Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers: we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment: protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations. To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link. OpenAI Global Applicant Privacy Policy At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology. #J-18808-Ljbffr OpenAI

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Reliability/DFX Engineer in San Francisco, CA vacancy
  •  ...research organization in San Francisco is seeking a cross-stack engineer to ensure reliability in next-generation AI systems. This hands-on position requires extensive experience in reliability modeling and DFX architecture to enhance the durability and performance of AI... 
    Suggested

    OpenAI

    San Francisco, CA
    3 days ago
  • Hudson Manpower is seeking a Mechanical Engineer - Offshore Reliability for a role involving the improvement of offshore mechanical equipment reliability and performance. This position requires a Bachelor's Degree in Mechanical Engineering and a minimum of 12 years of experience... 
    Suggested

    Hudson Manpower

    San Francisco, CA
    2 days ago
  • $133.58k - $224.5k

     ...build for the long term. About the role: Samsara’s Hardware Reliability team enables an exceptional customer experience by enabling...  ...to resolve key issues. Samsara’s Senior Hardware Reliability Engineer will design quality processes that guarantee the high-quality... 
    Suggested
    Full time
    Work at office
    Remote work
    Flexible hours

    Samsara

    San Francisco, CA
    2 days ago
  •  ...A leading AI research company based in San Francisco is seeking experienced reliability engineers to scale their infrastructure and ensure system performance and reliability. This role involves collaborating with diverse teams to develop resilient systems and enhance operations... 
    Suggested

    OpenAI

    San Francisco, CA
    3 days ago
  • $150k

     ...A technology company in San Francisco seeks a Research Engineer to develop their reliability platform for LLM applications. The role focuses on optimization and testing methodologies while emphasizing hands-on implementation and collaboration with clients. Ideal candidates... 
    Suggested

    Enboarder

    San Francisco, CA
    4 days ago
  • $150k - $180k

     ...’t it. The Role As we continue to develop and deploy cutting-edge autonomous technologies, we are seeking a Senior Reliability Engineer (REL) to lead efforts in ensuring the long-term performance, durability, and robustness of critical hardware systems. This role... 
    Full time
    Immediate start
    Worldwide
    Flexible hours
    Night shift

    Eight Sleep

    San Francisco, CA
    19 days ago
  •  ...We’re looking for a Systems Reliability Engineer to own the reliability of our system across cloud, edge, and real-world environments . Our platform runs across distributed infrastructure—connecting cloud services, on-site compute, and live video/data pipelines inside... 
    Permanent employment

    Claryo

    San Francisco, CA
    3 days ago
  •  ...A leading AI research company in San Francisco is seeking a Software Engineer to enhance infrastructure supporting cutting-edge AI systems. The role involves designing reliable systems and optimizing performance for millions of users. Ideal candidates possess experience... 

    OpenAI

    San Francisco, CA
    4 days ago
  • Responsibilities The Sr. Reliability Engineer will conduct root cause failure analysis (RCFA) to identify equipment breakdown causes and develop solutions to prevent recurrence. Perform reliability-centered maintenance (RCM) studies to identify critical equipment and... 
    Relocation

    Southern Recruiting Solutions, Inc.

    San Francisco, CA
    2 days ago
  • $160k - $190k

    Southern Recruiting Solutions, Inc. seeks a Sr. Reliability Engineer based in San Francisco, California. This role requires a Bachelor's in Mechanical Engineering and over 8 years of experience in a chemical plant or refinery. The successful candidate will conduct root... 

    Southern Recruiting Solutions, Inc.

    San Francisco, CA
    2 days ago
  • $180k - $230k

     ...Job Description Job Description Job Title: Staff Reliability Engineer Location: Burlingame, CA Department: ESS Engineering Reports To: Staff Reliability Engineer Position Type: Full-time About Peak Energy Peak Energy is the first American... 
    Full time
    Immediate start
    Flexible hours

    Peak Energy

    San Francisco, CA
    6 days ago
  • $293k - $385k

     ...About the Team The Infrastructure Engineering function sits within IT and is responsible for reliably building, deploying, and operating critical on prem and hybrid environments that power internal services and critical R&D environments. This is a new, bootstrap... 
    Work at office

    OpenAI

    San Francisco, CA
    3 days ago
  • $150k - $250k

     ...As our Founding Security Reliability Engineer at Charta Health, you'll pioneer the application of Site Reliability Engineering principles to ensure the unwavering security, resilience, and operational excellence of our cutting-edge generative AI platform. This is... 

    Charta Health

    San Francisco, CA
    2 days ago
  • scribehow.com is seeking a Senior Database Reliability Engineer based in San Francisco (hybrid model). You will own the reliability, performance, and scalability of our data tier and work with a growing engineering team. Your expertise will ensure smooth operations across... 
    Remote job

    scribehow.com

    San Francisco, CA
    12 hours ago
  • $175k - $300k

    Fluidstack, located in San Francisco, is seeking a Production Engineer to ensure the health of their compute fleet. You will build metrics pipelines and automate repair workflows, defining what production-ready hardware means. The ideal candidate has strong hardware intuition... 

    Fluidstack

    San Francisco, CA
    12 hours ago
  •  ...shape the future of healthcare, we’d love to meet you. About the role We’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our product. You’ll work across our distributed workflow... 
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    12 hours ago
  •  ...customer acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from companies like...  ...of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data monthly and... 

    Unify

    San Francisco, CA
    2 days ago
  •  ...be rock-solid for millions of daily users while enabling our engineering teams to ship fast. You'll own the operational health of our...  ...backend platform, building automation and tooling that improves reliability and partnering with engineering to design systems that are... 
    Work at office
    Work from home

    gamma.app

    San Francisco, CA
    12 hours ago
  •  ...A technology company in San Francisco is seeking a Senior Hardware Reliability Engineer. This hybrid role focuses on improving hardware reliability through rigorous testing and cross-functional collaboration with engineering and operations teams. The candidate will design... 

    Samsara

    San Francisco, CA
    7 hours ago
  •  ...hold a high bar, move fast, and care deeply about each other and our customers. About the Role We’re hiring a Senior Database Reliability Engineer to own the reliability, performance, and scalability of Scribe’s data tier. Our engineering org is doubling — which means... 
    Full time
    Work at office
    Remote work
    Home office
    Flexible hours
    3 days per week

    scribehow.com

    San Francisco, CA
    5 hours ago
  •  ...for talent across our geographies. Responsibilities Define reliability vision and roadmap, build and mentor a top-tier team, and embed...  ...equivalent industry experience in electronics or reliability engineering. 10+ years of experience in reliability engineering for... 
    Worldwide

    Reliabilityweb.com

    San Francisco, CA
    3 days ago
  • $150k - $250k

    Madrona Venture Labs is seeking a Founding Security Reliability Engineer in San Francisco to design and maintain secure infrastructure for generative AI healthcare solutions. This pivotal role focuses on applying SRE principles to bolster security within a regulated environment... 

    Madrona Venture Labs

    San Francisco, CA
    3 days ago
  • $200k - $250k

    Scribehow.com is looking for a Staff Database Reliability Engineer to take charge of their data infrastructure strategy and architecture. In this role, you will design scalable access patterns, drive observability across tools like pganalyze and Honeycomb, and lead infrastructure... 
    Flexible hours

    scribehow.com

    San Francisco, CA
    3 days ago
  • $350k

    Menlo Ventures is seeking a Research Engineer to enhance the reliability and infrastructure of AI systems focused on professional workflows. The ideal candidate will have substantial Python coding experience and a strong background in operating machine learning systems... 
    Work at office

    Menlo Ventures

    San Francisco, CA
    2 days ago
  •  ...Alembic, Inc. is looking for an experienced engineer to design and operate the global network of one of the world's fastest private supercomputers. The role demands strong skills in infrastructure engineering, network security, and automation for scalable operations.... 

    Alembic Limited

    San Francisco, CA
    5 hours ago
  •  ...Fluidstack is seeking a Network Engineer in San Francisco, California to oversee the health and operation of our extensive network. This role involves building active debugging tools, developing monitoring frameworks, and implementing automation for seamless network repair... 

    Fluidstack

    San Francisco, CA
    6 hours ago
  • $175k - $215k

     ...Software Reliability Engineer, Waymo Fleet Waymo is an autonomous driving technology company with the mission to be the world's most trusted driver. Since its start as the Google Self-Driving Car Project in 2009, Waymo has focused on building the Waymo Driver—The World... 
    Full time
    Remote work

    Waymo

    San Francisco, CA
    3 days ago
  •  ...platform to launch future Movewear products and transform millions of lives in the coming years. The Role As our Hardware Reliability Engineer, you will be the person who makes sure our products don't just work in the lab -- they work on real humans, in real conditions... 
    Full time
    Relocation
    Flexible hours

    Skip

    San Francisco, CA
    1 day ago
  • Skip is seeking a Hardware Reliability Engineer to ensure that wearable devices perform reliably in real-world conditions. This role requires designing robust testing programs, analyzing failure modes, and leading reliability tests for the MO/GO device. The ideal candidate... 

    Skip

    San Francisco, CA
    1 day ago
  • $181.1k - $318.4k

    A leading technology company is seeking a Software Reliability & Stability Quality Engineer for its Siri team. The role involves automating tests, collaborating to ensure software quality, and reporting metrics. Ideal candidates have strong programming skills in Swift... 

    Apple Inc.

    San Francisco, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Reliability/DFX Engineer. Be the first to apply!