Site Reliability Engineer - Hardware Infrastructure
NVIDIA Gruppe
At NVIDIA, Site Reliability Engineering provides a rare chance to define, develop, and support large-scale production systems with high efficiency and availability. This demanding position merges software and systems engineering efforts to guarantee flawless service operation with consistent reliability and uptime. As an SRE here, you will be part of a welcoming team that values collaboration and creativity, empowering developers to make significant updates while sustaining efficient system function. What you'll be doing: Develop and support guidelines for incident management, planned maintenance, and blameless postmortems. Assist teams in responding to high severity incidents, driving root cause analysis, crafting high-quality postmortems, and developing post-incident corrective actions. Define reliability and supportability metrics, Service Level Objectives, and error budgets. Develop and drive the adoption of actionable, customer‑centric monitoring and alerting. Apply automation and Generative AI/Agentic solutions to minimize manual and tedious activities and boost customer support. Guide teams on establishing sustainable on‑call and operational standards. What we need to see: Degree in Computer Science or a related technical field involving coding, or equivalent experience. 8+ years of experience in SRE, DevOps, or Production Engineering. Strong understanding of SRE principles, including incident management, error budgets, SLOs, and SLAs. Experience crafting and deploying systems that are fault‑tolerant, performant, and supportable. Background with infrastructure automation. Experience running critical services in production. Experience in one or more of the following: Python, Go, Perl, or Ruby. Hands‑on experience with observability platforms (e.g., Prometheus, Grafana). Strong communication skills with the ability to convey technical concepts effectively to diverse audiences. Flexibility and adaptability working in a fast‑paced environment with evolving requirements. Ways to stand out from the crowd: Expertise in establishing incident management and postmortem processes. Experience driving adoption of common tools and processes across diverse groups. Experience working with LLM/Generative AI/Agentic solutions to shorten mitigation time, lessen toil, and ensure Service Level Objectives are met. Hands‑on expertise operating and scaling distributed systems with tight SLAs, ensuring high availability and performance. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level4, and 224,000 USD - 356,500 USD for Level5. You will also be eligible for equity and benefits. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Gruppe
- A prominent tech company in Sunnyvale is seeking a Senior Signal Integrity Engineer to work on cutting-edge data center hardware. The role involves engaging with multiple teams to ensure signal integrity across various systems. Ideal candidates should have a Bachelor's...Suggested
- ...are seeking a highly skilled and motivated Connectivity Hardware Design Release Engineer (DRE) to join our team, focused on developing best-in-class... ...manager, and system engineering, to create secure and reliable connectivity solutions and assist in defining future technology...SuggestedContract workLocal areaWork from homeRelocation package
$184k - $287.5k
...push the boundaries of innovation and engineering? At NVIDIA, we lead the world in accelerated... ...high‑performance systems. As a Senior Hardware Systems Engineer, you will help build... ...with hyperscale data center infrastructure, including cooling methods, facility power...Suggested$70 - $75 per hour
...Job Description Job Description We are seeking a modeling Engineer to develop high-level models of complex SoC hardware. The virtual platforms combine models of custom hardware accelerators for vision, 2D and 3D graphics, machine learning and more, within a multi...SuggestedNight shift- A leading technology firm in Sunnyvale, CA, seeks a Product Quality Engineer for GPU platforms. This role involves leading quality initiatives, ensuring the reliability of hardware systems, and collaborating with manufacturing partners. Candidates should have a Bachelor...Suggested
- Product Quality Engineer, GPU Platforms, Hardware Quality and Reliability Google Sunnyvale, CA, USA Qualifications Bachelor's degree in Mechanical Engineering... .../debug methods). About the job Google’s Cloud infrastructure is one of the world's most advanced compute,...Contract work
$136k - $218.5k
...dedicated and motivated Software developer with particular interest in algorithms and RTL Design. Understanding both Software and Hardware principles will be a key requirement for this role. What you'll be doing: Architect, design, develop and support tools for RTL...Work experience placement$145k - $165k
...Graphics is seeking a highly experienced Site Reliability Engineer (SRE) to design, build, and operate... ...highly available, fault-tolerant infrastructure and services. Install, maintain, and... ...server, storage, and networking hardware in office and colocation facilities....Work at office$131k - $175k
...Senior Hardware Systems Engineer – AI Rack & Cluster Infrastructure Arista Networks is an industry leader in data-driven, client-to-cloud networking for large data center, campus and routing environments. What sets us apart is our relentless pursuit of innovation....Remote workFlexible hours$113.9k - $200.91k
...space and find a career that's built for you. Lockheed?Martin Space is seeking a full-time Hardware/Software Integrator. In this role you will support the software-engineering lifecycle defined in the program Software Development Plan (SDP) by establishing,...Full timeTemporary workWork experience placementInterim roleWork at officeRemote workRelocationFlexible hoursShift work$147k - $216k
...Architect for its Cloud Supply Chain and Operations team. This role will involve shaping design requirements and driving execution on hardware projects. The ideal candidate will have a Bachelor's degree in relevant fields and experience in data centers or machine learning...$132k - $189k
...specialized role which requires physical interaction with hardware equipment in a simulated data center environment,... ...equipment. Regular development and processing of engineering hardware must be performed on site. Bachelor’s degree in Electrical Engineering, Computer...Full time$207k - $300k
Site Reliability Engineering Manager, Google Distributed Cloud Google Sunnyvale, CA, USA Bachelor’s... ...managing distributed systems or cloud infrastructure, with a focus on Kubernetes. 3... ...an offering designed as a converged hardware and software solution for customers...Full time$120k - $172k
A leading technology company in California seeks a Product Quality Engineer for hardware within Google Cloud. This role involves owning the product quality process, utilizing advanced statistical methods, and collaborating with cross-functional teams to ensure exceptional...- General Motors is seeking a Connectivity Hardware Design Release Engineer in Sunnyvale, California, to develop best-in-class automotive connectivity systems. Responsibilities include sourcing, change management, and integration of connectivity telematics modules, alongside...
$159k - $231k
Senior Hardware Validation Engineer, Platforms Bachelor’s degree in Electrical Engineering, Computer... ...Lab team manages this critical infrastructure to ensure product teams can focus on... ...at unparalleled scale, efficiency, reliability and velocity. Our customers include...Full timeWork at officeWorldwide$147k - $216k
Senior Hardware Validation Engineer, Google Cloud Platform Google Sunnyvale, CA,... ...hardware must be performed on site. Bachelor’s degree in... ...development teams. The AI and Infrastructure team is redefining what’s... ...scale, efficiency, reliability and velocity. Our customers...Full timeWorldwide$170.2k
A global healthcare company is seeking a Hardware Reliability Test Engineer in Santa Clara, California. You will translate reliability requirements into test protocols, execute various reliability tests, and design custom test fixtures. The ideal candidate will have a...$136k - $218.5k
...NVIDIA is seeking capable customer-facing hardware engineers to work directly with Cloud Scale Providers (CSP’s) deploying next generation... ...as AI Factories, are vital to scale compute and networking infrastructure needed for agentic AI processing. The CSP HW Systems...$175k - $263k
...ready to seize the endless opportunities and leave your mark, come join us. THE ROLE We are seeking a highly motivated Hardware Design Engineer for Pure’s Datastore team. The Pure Engineering Team builds the industry’s most innovative, high-performance, energy-...Full timeWork at officeFlexible hours$147.4k - $272.1k
Site Reliability Engineer (Edge Services), Infrastructure Services Sunnyvale, California, United States Software and Services We are seeking a proactive Site Reliability Engineer to champion the evolution of our production ecosystems. In this role, you will help drive...RelocationShift work$164.8k - $226.6k
...higher performance, smaller size, lower power, and better reliability. With more than 4 billion devices shipped, SiTime is... ...: Job Summary We are seeking a hands-on Principal Infrastructure Hardware Engineer to architect, design, and deliver system platforms supporting...$120k - $172k
Product Quality Engineer, Hardware, Google Cloud corporate_fare Google place Sunnyvale, CA, USA Apply Bachelor's degree in Mechanical Engineering... ...discipline, or equivalent practical experience. Certified Reliability/Quality Engineer (CRE/CQE) certification or equivalent...Full timeWorldwide$140k - $300k
...critical role in supporting Tesla's AI hardware initiatives by developing automation, infrastructure, and services. Join a dynamic team of engineers dedicated to accelerating workloads... ...observability, and reporting to ensure system reliability and performance ~...Hourly payFull timeTemporary workFlexible hours$176k - $276k
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high... ...), or equivalent experience8+ years of experience with Infrastructure automation, distributed systems design, experience with...$120.3k - $194.53k
...kind of precision that drives great outcomes. Job Summary Palo Alto Networks runs a large hybrid infrastructure across multiple public clouds. As a Site Reliability Engineer on the Internet Security Platform team, you will be part of a team supporting Advanced DNS...Full timeWork at officeVisa sponsorshipWork visa$125k - $150k
A leading data storage firm is seeking a Platform Engineer IV to serve as a subject matter expert in hardware evaluation for enterprise storage appliances. The ideal candidate will have 5-7+ years of experience, particularly with Linux/ZFS systems. Responsibilities include...Work at office$190k - $220k
...revolution in AI data center infrastructure, enabling the next giant... ...first 3D-stacked photonics engine, Passage™, capable of connecting... ...coordination and management. Hardware Monitoring & Management:... ...automated tests to monitor the reliability and performance of the...Full timeTemporary workFlexible hours$120k - $172k
Manufacturing Test Development Engineer, Data Center Google Sunnyvale, CA, USA Apply... ...manufacturing test solutions for data center hardware, in particular networking and optics... .... When vendors build parts for our infrastructure, you are right there alongside ensuring...Full timeContract workWorldwide$128.4k - $226.44k
...resiliency. This position will be a key member of the Systems Engineering, Integration and Test (SEIT) team in support of the final design... ...for technical and program documentation - Support systems/hardware/software integration, test planning and execution and find opportunities...Full timeTemporary workWork experience placementWork at officeRemote workRelocationFlexible hoursShift work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Site Reliability Engineer - Hardware Infrastructure. Be the first to apply!
- site reliability engineer Santa Clara, CA
- site reliability engineer sre Santa Clara, CA
- principal infrastructure engineer Santa Clara, CA
- remote infrastructure engineer Santa Clara, CA
- data infrastructure engineer Santa Clara, CA
- senior infrastructure engineer Santa Clara, CA
- infrastructure engineer Santa Clara, CA
- infrastructure automation engineer Santa Clara, CA
- infrastructure developer Santa Clara, CA
- infrastructure engineering manager Santa Clara, CA


