Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Site Reliability Engineer - Hardware Infrastructure

NVIDIA Gruppe

At NVIDIA, Site Reliability Engineering provides a rare chance to define, develop, and support large-scale production systems with high efficiency and availability. This demanding position merges software and systems engineering efforts to guarantee flawless service operation with consistent reliability and uptime. As an SRE here, you will be part of a welcoming team that values collaboration and creativity, empowering developers to make significant updates while sustaining efficient system function. What you'll be doing: Develop and support guidelines for incident management, planned maintenance, and blameless postmortems. Assist teams in responding to high severity incidents, driving root cause analysis, crafting high-quality postmortems, and developing post-incident corrective actions. Define reliability and supportability metrics, Service Level Objectives, and error budgets. Develop and drive the adoption of actionable, customer‑centric monitoring and alerting. Apply automation and Generative AI/Agentic solutions to minimize manual and tedious activities and boost customer support. Guide teams on establishing sustainable on‑call and operational standards. What we need to see: Degree in Computer Science or a related technical field involving coding, or equivalent experience. 8+ years of experience in SRE, DevOps, or Production Engineering. Strong understanding of SRE principles, including incident management, error budgets, SLOs, and SLAs. Experience crafting and deploying systems that are fault‑tolerant, performant, and supportable. Background with infrastructure automation. Experience running critical services in production. Experience in one or more of the following: Python, Go, Perl, or Ruby. Hands‑on experience with observability platforms (e.g., Prometheus, Grafana). Strong communication skills with the ability to convey technical concepts effectively to diverse audiences. Flexibility and adaptability working in a fast‑paced environment with evolving requirements. Ways to stand out from the crowd: Expertise in establishing incident management and postmortem processes. Experience driving adoption of common tools and processes across diverse groups. Experience working with LLM/Generative AI/Agentic solutions to shorten mitigation time, lessen toil, and ensure Service Level Objectives are met. Hands‑on expertise operating and scaling distributed systems with tight SLAs, ensuring high availability and performance. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level4, and 224,000 USD - 356,500 USD for Level5. You will also be eligible for equity and benefits. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. #J-18808-Ljbffr NVIDIA Gruppe

Vacancy posted 5 days ago
Similar jobs that could be interesting for youBased on the Site Reliability Engineer - Hardware Infrastructure in Santa Clara, CA vacancy
  • A prominent tech company in Sunnyvale is seeking a Senior Signal Integrity Engineer to work on cutting-edge data center hardware. The role involves engaging with multiple teams to ensure signal integrity across various systems. Ideal candidates should have a Bachelor's... 
    Suggested

    Google Inc.

    Sunnyvale, CA
    2 days ago
  •  ...are seeking a highly skilled and motivated Connectivity Hardware Design Release Engineer (DRE) to join our team, focused on developing best-in-class...  ...manager, and system engineering, to create secure and reliable connectivity solutions and assist in defining future technology... 
    Suggested
    Contract work
    Local area
    Work from home
    Relocation package

    General Motors

    Sunnyvale, CA
    3 days ago
  • $184k - $287.5k

     ...push the boundaries of innovation and engineering? At NVIDIA, we lead the world in accelerated...  ...high‑performance systems. As a Senior Hardware Systems Engineer, you will help build...  ...with hyperscale data center infrastructure, including cooling methods, facility power... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $70 - $75 per hour

     ...Job Description Job Description We are seeking a modeling Engineer to develop high-level models of complex SoC hardware. The virtual platforms combine models of custom hardware accelerators for vision, 2D and 3D graphics, machine learning and more, within a multi... 
    Suggested
    Night shift

    Phizenix

    Sunnyvale, CA
    18 days ago
  • A leading technology firm in Sunnyvale, CA, seeks a Product Quality Engineer for GPU platforms. This role involves leading quality initiatives, ensuring the reliability of hardware systems, and collaborating with manufacturing partners. Candidates should have a Bachelor... 
    Suggested

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • Product Quality Engineer, GPU Platforms, Hardware Quality and Reliability Google Sunnyvale, CA, USA Qualifications Bachelor's degree in Mechanical Engineering...  .../debug methods). About the job Google’s Cloud infrastructure is one of the world's most advanced compute,... 
    Contract work

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • $136k - $218.5k

     ...dedicated and motivated Software developer with particular interest in algorithms and RTL Design. Understanding both Software and Hardware principles will be a key requirement for this role. What you'll be doing: Architect, design, develop and support tools for RTL... 
    Work experience placement

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $145k - $165k

     ...Graphics is seeking a highly experienced Site Reliability Engineer (SRE) to design, build, and operate...  ...highly available, fault-tolerant infrastructure and services. Install, maintain, and...  ...server, storage, and networking hardware in office and colocation facilities.... 
    Work at office

    Bolt Graphics

    Sunnyvale, CA
    3 days ago
  • $131k - $175k

     ...Senior Hardware Systems Engineer – AI Rack & Cluster Infrastructure Arista Networks is an industry leader in data-driven, client-to-cloud networking for large data center, campus and routing environments. What sets us apart is our relentless pursuit of innovation.... 
    Remote work
    Flexible hours

    Arista Networks, Inc.

    Santa Clara, CA
    1 day ago
  • $113.9k - $200.91k

     ...space and find a career that's built for you. Lockheed?Martin Space is seeking a full-time Hardware/Software Integrator. In this role you will support the software-engineering lifecycle defined in the program Software Development Plan (SDP) by establishing,... 
    Full time
    Temporary work
    Work experience placement
    Interim role
    Work at office
    Remote work
    Relocation
    Flexible hours
    Shift work

    Lockheed Martin Corporation

    Sunnyvale, CA
    11 days ago
  • $147k - $216k

     ...Architect for its Cloud Supply Chain and Operations team. This role will involve shaping design requirements and driving execution on hardware projects. The ideal candidate will have a Bachelor's degree in relevant fields and experience in data centers or machine learning... 

    Google Inc.

    Sunnyvale, CA
    4 days ago
  • $132k - $189k

     ...specialized role which requires physical interaction with hardware equipment in a simulated data center environment,...  ...equipment. Regular development and processing of engineering hardware must be performed on site. Bachelor’s degree in Electrical Engineering, Computer... 
    Full time

    Google Inc.

    Sunnyvale, CA
    1 day ago
  • $207k - $300k

    Site Reliability Engineering Manager, Google Distributed Cloud Google Sunnyvale, CA, USA Bachelor’s...  ...managing distributed systems or cloud infrastructure, with a focus on Kubernetes. 3...  ...an offering designed as a converged hardware and software solution for customers... 
    Full time

    Google Inc.

    Sunnyvale, CA
    2 days ago
  • $120k - $172k

    A leading technology company in California seeks a Product Quality Engineer for hardware within Google Cloud. This role involves owning the product quality process, utilizing advanced statistical methods, and collaborating with cross-functional teams to ensure exceptional... 

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • General Motors is seeking a Connectivity Hardware Design Release Engineer in Sunnyvale, California, to develop best-in-class automotive connectivity systems. Responsibilities include sourcing, change management, and integration of connectivity telematics modules, alongside... 

    General Motors

    Sunnyvale, CA
    5 days ago
  • $159k - $231k

    Senior Hardware Validation Engineer, Platforms Bachelor’s degree in Electrical Engineering, Computer...  ...Lab team manages this critical infrastructure to ensure product teams can focus on...  ...at unparalleled scale, efficiency, reliability and velocity. Our customers include... 
    Full time
    Work at office
    Worldwide

    Google Inc.

    Sunnyvale, CA
    4 days ago
  • $147k - $216k

    Senior Hardware Validation Engineer, Google Cloud Platform Google Sunnyvale, CA,...  ...hardware must be performed on site. Bachelor’s degree in...  ...development teams. The AI and Infrastructure team is redefining what’s...  ...scale, efficiency, reliability and velocity. Our customers... 
    Full time
    Worldwide

    Google Inc.

    Sunnyvale, CA
    2 days ago
  • $170.2k

    A global healthcare company is seeking a Hardware Reliability Test Engineer in Santa Clara, California. You will translate reliability requirements into test protocols, execute various reliability tests, and design custom test fixtures. The ideal candidate will have a... 

    Johnson & Johnson

    Santa Clara, CA
    4 days ago
  • $136k - $218.5k

     ...NVIDIA is seeking capable customer-facing hardware engineers to work directly with Cloud Scale Providers (CSP’s) deploying next generation...  ...as AI Factories, are vital to scale compute and networking infrastructure needed for agentic AI processing. The CSP HW Systems... 

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $175k - $263k

     ...ready to seize the endless opportunities and leave your mark, come join us. THE ROLE We are seeking a highly motivated Hardware Design Engineer for Pure’s Datastore team. The Pure Engineering Team builds the industry’s most innovative, high-performance, energy-... 
    Full time
    Work at office
    Flexible hours

    Everpure

    Santa Clara, CA
    3 days ago
  • $147.4k - $272.1k

    Site Reliability Engineer (Edge Services), Infrastructure Services Sunnyvale, California, United States Software and Services We are seeking a proactive Site Reliability Engineer to champion the evolution of our production ecosystems. In this role, you will help drive... 
    Relocation
    Shift work

    Apple Inc.

    Sunnyvale, CA
    5 days ago
  • $164.8k - $226.6k

     ...higher performance, smaller size, lower power, and better reliability. With more than 4 billion devices shipped, SiTime is...  ...: Job Summary We are seeking a hands-on Principal Infrastructure Hardware Engineer to architect, design, and deliver system platforms supporting... 

    SiTime Corporation

    Santa Clara, CA
    a month ago
  • $120k - $172k

    Product Quality Engineer, Hardware, Google Cloud corporate_fare Google place Sunnyvale, CA, USA Apply Bachelor's degree in Mechanical Engineering...  ...discipline, or equivalent practical experience. Certified Reliability/Quality Engineer (CRE/CQE) certification or equivalent... 
    Full time
    Worldwide

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • $140k - $300k

     ...critical role in supporting Tesla's AI hardware initiatives by developing automation, infrastructure, and services. Join a dynamic team of engineers dedicated to accelerating workloads...  ...observability, and reporting to ensure system reliability and performance ~... 
    Hourly pay
    Full time
    Temporary work
    Flexible hours

    Tesla

    Palo Alto, CA
    1 day ago
  • $176k - $276k

    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high...  ...), or equivalent experience8+ years of experience with Infrastructure automation, distributed systems design, experience with... 

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $120.3k - $194.53k

     ...kind of precision that drives great outcomes. Job Summary Palo Alto Networks runs a large hybrid infrastructure across multiple public clouds. As a Site Reliability Engineer on the Internet Security Platform team, you will be part of a team supporting Advanced DNS... 
    Full time
    Work at office
    Visa sponsorship
    Work visa

    Palo Alto Networks, Inc.

    Santa Clara, CA
    3 days ago
  • $125k - $150k

    A leading data storage firm is seeking a Platform Engineer IV to serve as a subject matter expert in hardware evaluation for enterprise storage appliances. The ideal candidate will have 5-7+ years of experience, particularly with Linux/ZFS systems. Responsibilities include... 
    Work at office

    iXsystems DBA TrueNAS, Inc (“TrueNAS”)

    Campbell, CA
    2 days ago
  • $190k - $220k

     ...revolution in AI data center infrastructure, enabling the next giant...  ...first 3D-stacked photonics engine, Passage™, capable of connecting...  ...coordination and management. Hardware Monitoring & Management:...  ...automated tests to monitor the reliability and performance of the... 
    Full time
    Temporary work
    Flexible hours

    Lightmatter

    Mountain View, CA
    2 days ago
  • $120k - $172k

    Manufacturing Test Development Engineer, Data Center Google Sunnyvale, CA, USA Apply...  ...manufacturing test solutions for data center hardware, in particular networking and optics...  .... When vendors build parts for our infrastructure, you are right there alongside ensuring... 
    Full time
    Contract work
    Worldwide

    Google Inc.

    Sunnyvale, CA
    5 days ago
  • $128.4k - $226.44k

     ...resiliency. This position will be a key member of the Systems Engineering, Integration and Test (SEIT) team in support of the final design...  ...for technical and program documentation - Support systems/hardware/software integration, test planning and execution and find opportunities... 
    Full time
    Temporary work
    Work experience placement
    Work at office
    Remote work
    Relocation
    Flexible hours
    Shift work

    Lockheed Martin Corporation

    Sunnyvale, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer - Hardware Infrastructure. Be the first to apply!