Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Systems Reliability Engineer, Observability at Scale

$184k - $356.5k

NVIDIA Gruppe

NVIDIA Gruppe is seeking a Senior Systems Software Engineer (SRE) in Santa Clara, California. This role focuses on designing and maintaining cloud systems with high efficiency and reliability. You will work on observability and telemetry collection platforms, ensuring optimal performance. The ideal candidate has over 8 years of expertise in automation, distributed systems, and significant coding skills, particularly in Python or Go. A strong knowledge of Kubernetes and Linux is essential. Competitive salaries with ranges from $184,000 to $356,500 based on experience are offered. #J-18808-Ljbffr NVIDIA Gruppe

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Senior Systems Reliability Engineer, Observability at Scale in Santa Clara, CA vacancy
  • NVIDIA Corporation is looking for a Senior Systems Software Engineer (SRE) in Santa Clara, California. This...  ..., building, and maintaining large-scale production systems using various engineering...  ...GPU cloud services run with maximum reliability, participating in service lifecycles... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $176k - $276k

    Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination...  ...aspects of large scale Observability & Telemetry collection platform... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $184k - $356.5k

    NVIDIA Corporation is seeking a Senior Systems Software Engineer based in Santa Clara, California. The ideal candidate will have deep experience in...  ...and a related degree. Knowledge of Kubernetes and large-scale systems is essential. Competitive salary ranging from $18... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • NVIDIA Corporation, located in Santa Clara, CA, is seeking a Senior Systems Software Engineer focused on GPU Performance at Scale. This role entails leading performance practices in large-scale GPU infrastructure and aligning AI workloads with next-generation datacenter... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • NVIDIA Corporation is seeking a Senior Systems Software Engineer to join its advanced infrastructure software team in Santa Clara, California. You...  ...designing, developing, and maintaining high-performance, rack-scale management solutions. The role emphasizes work in Rust,... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • Google Inc. in Sunnyvale, CA is looking for a Software Engineer to develop next-generation technologies crucial to...  .... The ideal candidate will have experience with large-scale infrastructure and distributed systems, along with proficiency in programming languages such... 
    Senior

    Google Inc.

    Sunnyvale, CA
    2 days ago
  • $272k - $431.25k

     ...NVIDIA, as a Principal Rack Scale Systems Infrastructure Engineer, you will build and...  ...integration needs. Establish reliability, security, validation,...  ...environments. Mentor senior engineers and technical...  ..., updates, rollback, observability, health, and remediation... 
    Shift work

    Jobleads-US

    Santa Clara, CA
    2 days ago
  • $207k - $300k

    Google Inc. is looking for a Staff Software Engineer specializing in Site Reliability Engineering in Sunnyvale, CA. This role combines software and systems engineering to build and manage distributed systems, ensuring high reliability and uptime. The ideal candidate should... 
    Senior

    Google Inc.

    Sunnyvale, CA
    3 days ago
  • $184k - $287.5k

     ...organization is seeking a Senior System Software Engineer to lead the evolution of...  ...our next-generation Data & Observability Platform. We serve and...  ...pipelines, and ensure platform reliability. What you’ll be doing:...  ...of handling massive scale. You will solve global latency... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • Proofpoint is seeking a Senior Architect in Sunnyvale, California, to lead the design of enterprise-scale distributed systems supporting over 50 million connected sensors. The role requires heavy experience in backend architecture, scaling production systems, and establishing... 
    Senior
    Flexible hours

    Proofpoint

    Sunnyvale, CA
    4 days ago
  •  ...Staff — Inference to design and optimize large-scale AI inference systems. The role demands 5+ years in systems engineering and expertise in large-scale inference...  ...closely with various teams to debug and drive the reliability of infrastructure. Competitive compensation and... 
    Senior
    Flexible hours

    RadixArk

    Palo Alto, CA
    2 days ago
  • $154k - $193k

     ...thermal batteries deliver reliable and cost-effective heat...  ...-driven and passionate Senior or Staff Mechanical Engineer, Fluid Systems to join our Product...  ...our team to deliver at scale You should be excited...  ...flexible and inclusive holiday observance, as well as paid... 
    Senior
    Flexible hours

    Antora Energy

    San Jose, CA
    8 days ago
  •  ...volume telemetry into reliable, job‑centric insights...  ...our team of innovative engineers who are building this...  ...Software Engineering and Systems Engineering team to...  ...of reliability for an observability/AIOps platform: SLOs/SLIs...  ...deploying, debugging, scaling) for telemetry‑heavy... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • A technology firm is seeking a Test Engineer to work with Google's test engineering team. Responsibilities include creating test plans...  ...ideal candidate will have strong experience in testing large-scale systems and proficiency in Unix/Linux or Windows. Excellent... 
    Senior

    TechDigital Group

    Mountain View, CA
    2 days ago
  • $152k - $241.5k

     ...We’re looking for a Senior SRE to join our...  ...critically important systems running while working...  ...supporting large‑scale HPC clusters using...  ...management, fleet reliability/auto‑healing, E2E observability or data‑driven operations...  .... Mentored other engineers and influenced... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $145k - $165k

    A technology solutions firm in Sunnyvale, CA is looking for a highly experienced Site Reliability Engineer (SRE). This role involves maintaining uptime and performance across systems. Exceptional Linux expertise and automation skills in Bash and Python are crucial. Key... 
    Senior

    Bolt Graphics, Inc.

    Sunnyvale, CA
    3 days ago
  • $200k - $322k

    Senior Manager, Site Reliability Engineering page is loaded## Senior Manager, Site Reliability...  ...operations function at scale. This role goes beyond traditional...  ...to build AI-powered systems that enhance reliability...  ...operating model using observability, AI insights, and... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • Senior Staff Software Engineer, Site Reliability Engineering In accordance with Washington state...  ...troubleshooting distributed systems. Preferred...  ...passion for monitoring and observability. Familiarity with the emerging...  ...overall system health. Scale systems sustainably through... 
    Senior
    Temporary work

    Google Inc.

    Sunnyvale, CA
    14 hours ago
  • $126k - $204.5k

     ...operating and maintaining a large‑scale GCP environment, including...  ...of our comprehensive observability systems. To meet the opportunities...  ...collaborate closely with our engineering teams to develop innovative...  ...the product and ensure the reliability and availability of our... 
    Senior

    Palo Alto Networks, Inc.

    Santa Clara, CA
    4 days ago
  •  ...seeking an experienced Senior Architect to lead...  ...of enterprise‑scale distributed systems supporting 50M+ connected...  ...for scalability, reliability, security, and...  ...optimization Data Platform Engineering Architect real‑time...  ..., Reliability & Observability Establish and... 
    Senior
    Flexible hours

    Proofpoint

    Sunnyvale, CA
    14 hours ago
  •  ...passionate about building world-class reliability systems? Join NVIDIA as a Senior Software Engineer - Resilience Engineering, DGX...  ...experience in running large-scale systems and a deep...  ...organization. Proficiency with modern observability and operational tools like Prometheus... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • Rhoda AI in Mountain View is seeking a Staff / Principal ML Training Systems Engineer to lead the performance of large-scale multimodal training systems. This role involves improving training efficiency and collaborating closely with research teams to accelerate model iteration... 
    Senior

    Rhoda AI

    Mountain View, CA
    2 days ago
  •  ...We are looking for a Senior Software Engineer to help build NeMo Platform, NVIDIA...  ..., and operating AI systems at scale. This role will focus on NeMo...  ...practical infrastructure for observing behavior, measuring...  ...improvement techniques into reliable, reusable product capabilities... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    14 hours ago
  • $160k - $322k

    NVIDIA Gruppe in Santa Clara is seeking a Senior Technical Marketing Engineer focused on GPUs and scale-up architecture. The role involves showcasing NVIDIA's GPU architecture and server-level platforms, aiming to maximize performance for AI applications. The ideal candidate... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $272k - $425.5k

    Principal Software Engineer – Large-Scale LLM Memory and Storage Systems page is loaded## Principal Software Engineer – Large-Scale LLM Memory and Storage Systemslocations...  ...accelerators and memory pools.* Mentor senior and junior engineers, set technical direction for memory... 
    Local area
    Remote work

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $224k - $431.25k

    NVIDIA Gruppe is seeking a Senior System Software Engineer for Cloud in Santa Clara, California. The role involves designing and building scalable cloud solutions for GeForce NOW. Candidates should have extensive experience with Java, Golang, and Kubernetes, along with... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • A leading technology company is seeking a Senior System Software Engineer for Cloud in Santa Clara, CA. This role involves designing and deploying scalable cloud-based solutions for a cloud gaming service. The ideal candidate will have extensive experience with programming... 
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $168k - $270.25k

     ...schema design, and expand observability over the factory...  ...develop scalable and reliable factory components. Work...  ...distributed and compute systems, backend services,...  ...Computer Science, Computer Engineering or related field (or...  ...experience working with large‑scale full‑stack development... 
    Senior

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • Ll Oefentherapie is seeking a Senior Principal Software Developer in Santa Clara, California. This role entails leading the design and operation of high-scale distributed systems while mentoring engineers within the team. Applicants should have over 10 years of software... 
    Senior

    Ll Oefentherapie

    Santa Clara, CA
    1 day ago
  • Senior Systems Software Engineer - GPU Performance at Scale We are looking for a dedicated engineer for the Senior Systems Software Engineer role, focusing on GPU Performance at Scale. The position will drive innovation in AI and GPU computing. What You’ll Be Doing Lead... 
    Senior

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Systems Reliability Engineer, Observability at Scale. Be the first to apply!