Senior Site Reliability Engineer - HPC
$152k - $241.5kNVIDIA
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. We’re looking for a Senior SRE to join our Compute Farm team and help build the next generation of our global services platform. At NVIDIA, you’ll keep critically important systems running while working on the technologies that are redefining computing. You’ll harness the power of AI to deliver groundbreaking solutions to some of the world’s toughest problems—and see your work have real, lasting impact! What you'll be doing: Own SRE solutions end‑to‑end, from design and implementation to operation and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics. Use IaC(Infrastructure‑as‑Code) and config management to standardize and automate provisioning everywhere. Deliver solutions in a globally distributed, multi‑cloud hybrid environment – On‑prem, AWS, GCP, and OCI. Design for failure with redundancy, failure domains, progressive delivery, and strict change control. Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence. Conduct capacity management and planning to meet ongoing operational needs. Detects performance issues and recommends solutions to maintain world‑class service quality. Collaborate with various teams in a fast‑paced environment to ensure seamless project completion. Participate in on-call, incident reviews, assist in root cause identification, and produce high-quality RCA reports. What we need to see: B.S. degree in Computer Science or related technical field (or equivalent experience) with 5+ years professional experience building and supporting critical services. Experience supporting large‑scale HPC clusters using Slurm, LSF or Kubernetes clusters, including setup, tuning, and troubleshooting. Proficiency in modern CI/CD techniques, and Infrastructure as Code (IaC) for managing services. Strong experience crafting large-scale infrastructure platforms for automated host lifecycle management, fleet reliability/auto-healing, E2E observability or data-driven operations (AIOps/ML-driven signals) that materially reduce manual intervention. Proficient in monitoring, metrics, container management, and log collection tools. 5+ years of coding/scripting experience in at least two high‑level programming languages such as Python, Go, Perl, or Ruby. Mentored other engineers and influenced technical direction through design reviews, architecture documents, and strong partnership with product and leadership. Creative problem solver with excellent debugging skills and strong communication and documentation abilities. Ways to stand out from the crowd: Published technical write‑ups or talks (conference presentations, meetups, engineering blogs) that deep‑dive into real‑world reliability, observability, or large‑scale HPC/SRE problems and their solutions. Maintainer or co‑maintainer responsibilities for an open source component used in production (plugins, operators, exporters, controllers, or SDKs) at large scale. Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. We have some of the most brilliant and talented people in the world working for us and, due to unprecedented growth, our world-class engineering teams are growing fast. If you're a creative and autonomous engineer with real passion for technology, we want to hear from you. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until June 19, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA pioneered accelerated computing. Today, our AI infrastructure powers global intelligence, transforming every industry. Learn more about NVIDIA.
$152k - $241.5k
...lasting impact on the world. We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. Our team builds and operates... ...software development, crafting and building reliable distributed systems, and has the ability to...SeniorFull time- ## Principal Site Reliability EngineerApplylocations: 100 New Millennium Way, Bldg 2, Durham NCtime type: Full timeposted on: Posted Todaytime... ...into the ecosystem by applying best practices in Resiliency Engineering, Automation, Observability and Chaos Testing. Streamlines...Suggested
- ...Senior Kubernetes Platform Engineer - AI/ML Infrastructure Join our Platform Engineering team to design, build, and operate large-scale, on-prem... ...AI/ML infrastructure enablement, ensuring performance, reliability, and scalability across distributed systems. You...Senior
$207k - $300k
Software Engineering Manager II, Site Reliability Engineering Location: Seattle, WA, USA; Sunnyvale, CA, USA; +4 more; +3 more Advanced Experience owning outcomes and decision making, solving ambiguous problems and influencing stakeholders; deep expertise in domain....SuggestedFull timeTemporary work- ...Senior Systems/Software Engineer This role has been designed as "Onsite" with an expectation that you will primarily work from an HPE office. Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies...SeniorWork experience placementWork at office
- ...Senior Systems Software Engineer This role has been designed as ‘Hybrid’ with an expectation that you will work on average 2 days per week from... ...to uncover edge cases, performance limits, and reliability opportunities Debug and resolve challenging issues that...SeniorWork experience placementWork at officeLocal areaImmediate start2 days per week
- ...Senior V-Force Developer VAST Data is looking to add a Senior V-Force Developer to our growing team! This is a unique opportunity... ...in C, C++, or Python to join our V-Force team—a specialized engineering squad embedded within R&D. This is a hands-on development...Senior
- ...Senior Software Engineer - Java & Springboot Westlake, Texas, United States About the Job Position: Senior Software Engineer - Java & Springboot Work Authorization: All Work Authorizations Location: Hybrid - 3 times/month onsite; Any of the offices in:...Senior
- ...world. Your role and responsibilities We’re looking for Senior Software Engineers to help design, build, and operate the core systems behind... ...Vault’s core functionality Build systems that are reliable, scalable, and straightforward tooperatein production environments...SeniorWorldwide
- ...Senior Staff Software Engineer Direct Supply is building the future of senior living technology, helping connect the spectrum of healthcare... ...ensuring consistency in design, security, performance, and reliability. Identify and evaluate emerging technologies for...Senior
- ...Senior Software Engineer Our client is a fast-growing technology startup using AI to simplify complex sales processes. Their platform helps teams surface risks, automate repetitive tasks, and drive faster, more consistent execution across sales and customer success...SeniorRelocation package
- Job Overview Precision Biosciences is seeking an experienced senior systems engineer focused on the availability, security, and robustness of our IT systems. This person will be responsible for designing and administering cloud-based SaaS/PaaS, hypervisors, virtual servers...SeniorContract workFor contractorsWork at officeLocal area
- ...Senior Software Engineer Work Arrangement: Hybrid (On-Site and Remote mix) Location: Durham, NC, US, 27710 Personnel Area: CENTRAL ADMIN MANAGEMENT CTR Be You. Be Bold. Choose Duke. Duke University is seeking a Senior Software Engineer (SAP) to help design, build...SeniorRemote workFlexible hours
- ...Durham, NC or Location to be confirmed Our client seeks a Senior Software Engineer in Test to design, develop, and maintain automated test... ...quality while ensuring performance, security, usability, reliability, and scalability. The position includes collaboration...SeniorHourly payLocal area
- ...Senior Software Engineer ENFOS is redefining how global enterprises manage long-term environmental risk. As the category leader in Environmental... ...-making, mitigate risk, and drive long-term balance sheet reliability. About the Role As a Senior Software Engineer at...Senior
- ...HPE Morpheus Software - Senior Software Engineer This role has been designed as 'Hybrid' with an expectation that you will work on average 2 days per week from an HPE office. Who We Are: Hewlett Packard Enterprise is the global edge-to-cloud company advancing...SeniorWork at office2 days per week
- ...Full Stack Engineer Duration: Long Term Contract Location: Durham, NC Skills: ~6+ years Oracle and JAVA Development experience required. ~ Extensive, hands-on working experience with Oracle database, PL/SQL, SQL, Data Modeling and Shell Scripting ~ Experience...SeniorLong term contractWork experience placement
- ...Senior+ Software Engineer Opine is a fast-growing AI sales tech startup trusted by leading companies like Docker, Socket Security, BigID, and MaintainX. Opine transforms scattered conversations, notes, and documents into dynamic, living account plans. The platform proactively...SeniorRelocation
- A technology staffing firm is looking for a Senior iOS Developer to join a financial services client in Durham, NC. This full-time opportunity requires designing, developing, and maintaining applications using Swift, alongside hands-on development skills in Kotlin and...SeniorFull time
- ...Senior Java Developer Our client, a leader in healthcare and life sciences, is seeking a Senior Java Developer to... ...Collaborate with product owners, architects, and other engineers to improve system reliability and performance Participate in architectural discussions...SeniorWeekly payTemporary workFlexible hours3 days per week
- ...Senior Software DevOps Engineer Equity Technology is seeking a Senior Software DevOps Engineer to support DevOps patterns and practices, and standardize tools and solutions across Equity Technology. You will join a team of six global DevOps engineers and collectively...SeniorShift work
- A leading tech company is seeking a Senior Engineer with a backend focus to drive innovative software solutions that make a global impact. You will lead the design and implementation of features, collaborate with diverse engineering teams, and act as a technical expert...Senior
- ...Senior Java Engineer Location: Durham, NC (100 New Millennium Way)/ Westlake, TX Duration: 12 + Month Interview: Video Job Summary... ...that prioritize scalability, volume handling, and reliability. Qualifications & Experience Experience: 7+ years of...Senior
- ...Position: Asset Mgmt - Senior DevOps Engineer Location: Durham, NC / Merrimack, NH / Jersey City, NJ / Westlake, TX Work Mode... ...infrastructure, ensuring 45 squads have a consistent, scalable, and reliable delivery framework. As a member of the product...SeniorRelocationMonday to Friday
$199.7k - $254.6k
...and security — partnering across engineering, security, compliance, and product... ...technology leader. As a Senior Software Engineer in Application Reliability, you will own the reliability of... ...insurance. Please see the Cisco careers site to discover more benefits and...SeniorFull timeTemporary workLocal areaFlexible hours- ...Position Overview This position is intended to be a member of the engineering team, developing software for medical information products. Company Overview At FUJIFILM Healthcare Americas Corporation, we're on a mission to innovate for a healthier world, and we...SeniorLocal areaFlexible hours
- ...Senior Fullstack Software Developer Restor3d is looking for an experienced Senior Fullstack Software Developer for developing web/mobile platforms. This early-stage role will be critical in accelerating the development and release of tools to advance our patient specific...Senior
- Infosys Limited is seeking a Package Consultant in Durham, North Carolina. In this role, you will analyze business processes, define solution architecture, and implement complex ERP customizations. Your expertise in SAP PP and MM is crucial for driving successful digital...Senior
- ...Job Description Job Description Description: On-site Every Other Week onsite / 5 days in either Durham, NC or Merrimack, NH or Westlake, TX As a Senior Java AWS Developer, you will utilize strong cloud experience to develop solutions that build cloud native...SeniorHourly payLocal area
- SAP ABAP Developer Location: Durham, North Carolina (Hybrid, Onsite 3 days/week). Duration: 6-12 months. Key Responsibilities Develop, enhance, and maintain SAP ABAP programs across multiple SAP modules (ECC or S/4HANA). Work onsite with business users and functional...Senior3 days per week
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Site Reliability Engineer - HPC. Be the first to apply!
- senior manager quality engineering Durham, NC
- senior software test automation engineer Durham, NC
- senior director of development Durham, NC
- senior director clinical development Durham, NC
- senior cloud solutions architect Durham, NC
- senior strategic account manager Durham, NC
- senior civil engineer project manager Durham, NC
- sr technical product manager Durham, NC
- sr operations manager Durham, NC
- senior account executive Durham, NC


