Senior Systems Reliability Engineer, Observability at Scale
$184k - $356.5kNVIDIA Gruppe
NVIDIA Gruppe is seeking a Senior Systems Software Engineer (SRE) in Santa Clara, California. This role focuses on designing and maintaining cloud systems with high efficiency and reliability. You will work on observability and telemetry collection platforms, ensuring optimal performance. The ideal candidate has over 8 years of expertise in automation, distributed systems, and significant coding skills, particularly in Python or Go. A strong knowledge of Kubernetes and Linux is essential. Competitive salaries with ranges from $184,000 to $356,500 based on experience are offered. #J-18808-Ljbffr NVIDIA Gruppe
- NVIDIA Corporation is looking for a Senior Systems Software Engineer (SRE) in Santa Clara, California. This... ..., building, and maintaining large-scale production systems using various engineering... ...GPU cloud services run with maximum reliability, participating in service lifecycles...Senior
$176k - $276k
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination... ...aspects of large scale Observability & Telemetry collection platform...Senior$184k - $356.5k
NVIDIA Corporation is seeking a Senior Systems Software Engineer based in Santa Clara, California. The ideal candidate will have deep experience in... ...and a related degree. Knowledge of Kubernetes and large-scale systems is essential. Competitive salary ranging from $18...Senior- NVIDIA Corporation, located in Santa Clara, CA, is seeking a Senior Systems Software Engineer focused on GPU Performance at Scale. This role entails leading performance practices in large-scale GPU infrastructure and aligning AI workloads with next-generation datacenter...Senior
- NVIDIA Corporation is seeking a Senior Systems Software Engineer to join its advanced infrastructure software team in Santa Clara, California. You... ...designing, developing, and maintaining high-performance, rack-scale management solutions. The role emphasizes work in Rust,...Senior
- Google Inc. in Sunnyvale, CA is looking for a Software Engineer to develop next-generation technologies crucial to... .... The ideal candidate will have experience with large-scale infrastructure and distributed systems, along with proficiency in programming languages such...Senior
$272k - $431.25k
...NVIDIA, as a Principal Rack Scale Systems Infrastructure Engineer, you will build and... ...integration needs. Establish reliability, security, validation,... ...environments. Mentor senior engineers and technical... ..., updates, rollback, observability, health, and remediation...Shift work$207k - $300k
Google Inc. is looking for a Staff Software Engineer specializing in Site Reliability Engineering in Sunnyvale, CA. This role combines software and systems engineering to build and manage distributed systems, ensuring high reliability and uptime. The ideal candidate should...Senior$184k - $287.5k
...organization is seeking a Senior System Software Engineer to lead the evolution of... ...our next-generation Data & Observability Platform. We serve and... ...pipelines, and ensure platform reliability. What you’ll be doing:... ...of handling massive scale. You will solve global latency...Senior- Proofpoint is seeking a Senior Architect in Sunnyvale, California, to lead the design of enterprise-scale distributed systems supporting over 50 million connected sensors. The role requires heavy experience in backend architecture, scaling production systems, and establishing...SeniorFlexible hours
- ...Staff — Inference to design and optimize large-scale AI inference systems. The role demands 5+ years in systems engineering and expertise in large-scale inference... ...closely with various teams to debug and drive the reliability of infrastructure. Competitive compensation and...SeniorFlexible hours
$154k - $193k
...thermal batteries deliver reliable and cost-effective heat... ...-driven and passionate Senior or Staff Mechanical Engineer, Fluid Systems to join our Product... ...our team to deliver at scale You should be excited... ...flexible and inclusive holiday observance, as well as paid...SeniorFlexible hours- ...volume telemetry into reliable, job‑centric insights... ...our team of innovative engineers who are building this... ...Software Engineering and Systems Engineering team to... ...of reliability for an observability/AIOps platform: SLOs/SLIs... ...deploying, debugging, scaling) for telemetry‑heavy...Senior
- A technology firm is seeking a Test Engineer to work with Google's test engineering team. Responsibilities include creating test plans... ...ideal candidate will have strong experience in testing large-scale systems and proficiency in Unix/Linux or Windows. Excellent...Senior
$152k - $241.5k
...We’re looking for a Senior SRE to join our... ...critically important systems running while working... ...supporting large‑scale HPC clusters using... ...management, fleet reliability/auto‑healing, E2E observability or data‑driven operations... .... Mentored other engineers and influenced...Senior$145k - $165k
A technology solutions firm in Sunnyvale, CA is looking for a highly experienced Site Reliability Engineer (SRE). This role involves maintaining uptime and performance across systems. Exceptional Linux expertise and automation skills in Bash and Python are crucial. Key...Senior$200k - $322k
Senior Manager, Site Reliability Engineering page is loaded## Senior Manager, Site Reliability... ...operations function at scale. This role goes beyond traditional... ...to build AI-powered systems that enhance reliability... ...operating model using observability, AI insights, and...Senior- Senior Staff Software Engineer, Site Reliability Engineering In accordance with Washington state... ...troubleshooting distributed systems. Preferred... ...passion for monitoring and observability. Familiarity with the emerging... ...overall system health. Scale systems sustainably through...SeniorTemporary work
$126k - $204.5k
...operating and maintaining a large‑scale GCP environment, including... ...of our comprehensive observability systems. To meet the opportunities... ...collaborate closely with our engineering teams to develop innovative... ...the product and ensure the reliability and availability of our...Senior- ...seeking an experienced Senior Architect to lead... ...of enterprise‑scale distributed systems supporting 50M+ connected... ...for scalability, reliability, security, and... ...optimization Data Platform Engineering Architect real‑time... ..., Reliability & Observability Establish and...SeniorFlexible hours
- ...passionate about building world-class reliability systems? Join NVIDIA as a Senior Software Engineer - Resilience Engineering, DGX... ...experience in running large-scale systems and a deep... ...organization. Proficiency with modern observability and operational tools like Prometheus...Senior
- Rhoda AI in Mountain View is seeking a Staff / Principal ML Training Systems Engineer to lead the performance of large-scale multimodal training systems. This role involves improving training efficiency and collaborating closely with research teams to accelerate model iteration...Senior
- ...We are looking for a Senior Software Engineer to help build NeMo Platform, NVIDIA... ..., and operating AI systems at scale. This role will focus on NeMo... ...practical infrastructure for observing behavior, measuring... ...improvement techniques into reliable, reusable product capabilities...Senior
$160k - $322k
NVIDIA Gruppe in Santa Clara is seeking a Senior Technical Marketing Engineer focused on GPUs and scale-up architecture. The role involves showcasing NVIDIA's GPU architecture and server-level platforms, aiming to maximize performance for AI applications. The ideal candidate...Senior$272k - $425.5k
Principal Software Engineer – Large-Scale LLM Memory and Storage Systems page is loaded## Principal Software Engineer – Large-Scale LLM Memory and Storage Systemslocations... ...accelerators and memory pools.* Mentor senior and junior engineers, set technical direction for memory...Local areaRemote work$224k - $431.25k
NVIDIA Gruppe is seeking a Senior System Software Engineer for Cloud in Santa Clara, California. The role involves designing and building scalable cloud solutions for GeForce NOW. Candidates should have extensive experience with Java, Golang, and Kubernetes, along with...Senior- A leading technology company is seeking a Senior System Software Engineer for Cloud in Santa Clara, CA. This role involves designing and deploying scalable cloud-based solutions for a cloud gaming service. The ideal candidate will have extensive experience with programming...Senior
$168k - $270.25k
...schema design, and expand observability over the factory... ...develop scalable and reliable factory components. Work... ...distributed and compute systems, backend services,... ...Computer Science, Computer Engineering or related field (or... ...experience working with large‑scale full‑stack development...Senior- Ll Oefentherapie is seeking a Senior Principal Software Developer in Santa Clara, California. This role entails leading the design and operation of high-scale distributed systems while mentoring engineers within the team. Applicants should have over 10 years of software...Senior
- Senior Systems Software Engineer - GPU Performance at Scale We are looking for a dedicated engineer for the Senior Systems Software Engineer role, focusing on GPU Performance at Scale. The position will drive innovation in AI and GPU computing. What You’ll Be Doing Lead...Senior
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Systems Reliability Engineer, Observability at Scale. Be the first to apply!
- system engineer contract Santa Clara, CA
- application system engineer Santa Clara, CA
- system test engineer Santa Clara, CA
- senior windows systems engineer Santa Clara, CA
- lead system engineer Santa Clara, CA
- system performance engineer Santa Clara, CA
- senior staff systems engineer Santa Clara, CA
- director systems engineering Santa Clara, CA
- systems engineer Santa Clara, CA
- computer system validation engineer Santa Clara, CA



