Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior AI and HPC Observability Engineer

$152k - $241.5k

NVIDIA

NVIDIA is a pioneer in accelerated computing, known for inventing the GPU and driving breakthroughs in gaming, computer graphics, high-performance computing, and artificial intelligence. Our technology powers everything from generative AI to autonomous systems, and we continue to shape the future of computing through innovation and collaboration. Within this mission, our team, Managed AI Superclusters (MARS) builds and scales the infrastructure, platforms, and tools that enable researchers and engineers to develop the next generation of AI/ML systems. By joining us, you’ll help design solutions that power some of the world’s most advanced computing workloads.

Observability is at the heart of this transformation. We are looking for a strong AI & HPC Observability Engineer to build and scale next-generation Observability and Telemetry platforms. You will design and develop high-throughput, reliable telemetry pipelines and modern data infrastructure. This role requires solid distributed systems fundamentals, production-grade coding, and a passion for operational excellence.

What You Will Be Doing:

  • Design and scale observability platforms handling high-volume metrics, logs, and traces across distributed environments

  • Build high-performance backend services for telemetry ingestion, processing, and routing

  • Develop and extend OpenTelemetry collectors, processors, exporters, and instrumentation libraries

  • Build and optimize metrics pipelines using large-scale time-series storage systems

  • Design and operate real-time and batch telemetry pipelines using streaming and distributed data technologies

  • Improve platform reliability, performance, and cost efficiency through tuning, capacity planning, and system optimization

  • Develop monitoring, alerting, and service reliability frameworks to ensure platform health and performance

  • Collaborate with platform engineering, infrastructure, and site reliability teams to deliver production-grade observability solutions

What We Need to see:

  • Bachelor’s degree in Computer Science, Computer Engineering, or related field or equivalent experience

  • 5+ years of experience building backend or distributed systems in production environments

  • Strong programming skills in Python, Go, or Java, with experience developing production-quality software

  • Hands-on experience with modern observability architectures, including metrics, logs, and traces

  • Solid experience with PromQL and time-series data systems

  • Experience building or operating distributed data pipelines using technologies such as Kafka, Spark, or Flink

  • Experience working with Kubernetes and cloud-native infrastructure

  • Strong understanding of distributed systems, concurrency, and fault-tolerant system design. Strong debugging, performance tuning, and production operations skills

Ways To Stand Out from The Crowd:

  • Proven experience designing and scaling observability platforms for AI, GPU, or HPC environments

  • Hands-on expertise with OpenTelemetry, Prometheus, Kafka, and high-volume distributed telemetry pipelines

  • Strong background in data engineering, time-series data modeling, and real-time performance tuning

  • Experience integrating observability with AI/ML pipelines, GPU workload monitoring, or intelligent alerting

  • Demonstrated use of statistical or machine learning techniques for anomaly detection, correlation, or predictive insights

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4.

You will also be eligible for equity and benefits ( .

Applications for this job will be accepted at least until March 6, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Senior AI and HPC Observability Engineer in Santa Clara, CA vacancy
  • $184k - $287.5k

     ...NVIDIA Math Libraries team is looking for a senior engineer to join our development efforts in the area of kernel generation for AI and HPC, specifically targeting matrix operations, JITing and fusions. Around the world, leading commercial and academic organizations are... 
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $152k - $241.5k

     ...into the unlimited potential of AI to define the next era of...  ...world. We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. Our team...  ...implementation, testing, rollout, observability, and iterative improvement.... 
    Senior

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $152k - $241.5k

     ...tapping into the unlimited potential of AI to define the next era of computing....  ...seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU...  ...infrastructure provisioning, management, observability and day to day operation through... 
    Senior

    NVIDIA

    Santa Clara, CA
    21 hours ago
  • $184k - $287.5k

     ...the unlimited potential of AI to define the next era of computing...  ...and management. As a Senior Software Engineer - Datacenter Systems, you will...  ...run today's fastest HPC and AI workloads. This role...  ...and SLAs. Proficiency with observability tools such as Prometheus and... 
    Senior
    Remote work

    NVIDIA

    Santa Clara, CA
    21 hours ago
  • $152k - $241.5k

     ...role drives improvements in observability, service reliability, and automation...  ..., and aligned with long-term engineering demands. What you'll be...  ...(LSF, Slurm, etc.) in HPC or silicon design environments...  ...existing vacancy. NVIDIA uses AI tools in its recruiting... 
    Senior

    NVIDIA

    Santa Clara, CA
    21 hours ago
  • $90k - $215k

     ...Senior Software Engineer- Observability and Reliability Platform Engineering (REMOTE) Senior Software Engineer- Observability and Reliability Platform...  ...operations, real-time communication) Knowledge in ML and AI technologies Knowledge on Open-source monitoring software... 
    Senior
    Hourly pay
    Full time
    Work experience placement
    Local area
    Remote work
    Flexible hours

    GEICO

    San Jose, CA
    4 days ago
  • $125k - $175k

     ...Role Overview We are hiring a hands-on Senior Software & AI Test Engineer to design and operationalize a scalable, automation-first quality...  ...design • Definition of done includes validation and observability • Partner with engineering and product to define: •... 
    Senior
    Shift work

    Covalent

    Sunnyvale, CA
    1 day ago
  • $140k - $224.25k

     ...markets include gaming, automotive, vision, HPC, datacenters and networking in addition...  .... NVIDIA is also well positioned as the ‘AI Computing Company’, and NVIDIA GPUs are...  ...experience) in a STEM (Science, Technology, Engineering, Math or Physics) field ~5+ years... 
    Senior

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

     ...tapping into the unlimited potential of AI to define the next era of computing. An era...  ...-gen distributed storage services for HPC workloads, optimizing both performance and...  ...s degree in Computer Science, Electrical Engineering or related field or equivalent experience... 
    Senior

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $168k - $270.25k

     ...Senior Engineer For Factory Infrastructure And Automation NVIDIA is the platform upon which every new AI-powered application is built. We are seeking a senior engineer to design...  ...modeling and schema design, and expanding observability over the factory pipeline and its... 
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $139k - $204k

     ...Senior Software Engineer, Storage Engineer Livingston, NJ/ New York, NY / Sunnyvale, CA / Bellevue...  ...CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave...  ...the reliability, durability, and observability of our storage stack. Collaborate with... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    21 hours ago
  • $224k - $356.5k

     ...At NVIDIA, our Financial Systems Engineering team is at the heart of ensuring that our massive...  ...), including Kubernetes, Docker, CI/CD, observability, and reliability engineering. Your...  ...an existing vacancy.  NVIDIA uses AI tools in its recruiting processes. NVIDIA... 
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $184k - $287.5k

     ...We are looking for a Senior Software Engineer to help build NeMo Platform, NVIDIA’s product for developing...  ...evaluating, deploying, and operating AI systems at scale. This role will focus...  ...need practical infrastructure for observing behavior, measuring progress, catching... 
    Senior
    Remote work

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $184k - $287.5k

     ...We are seeking highly skilled and motivated software engineers to join us and build AI inference systems that serve large-scale models with extreme...  .../Azure), infrastructure as code, CI/CD, and production observability. Contributions to open-source projects and/or... 
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $190k - $282k

     ...Senior Security Production Engineer Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA / San Francisco...  ...is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave...  ...security infrastructure, enhancing observability, and responding to production... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    21 hours ago
  • $184k - $287.5k

     ...seeking a highly skilled and experienced Senior DevOps Engineer to join NVIDIA’s Robotics DevOps team!...  ...‑owned problems. Improve observability and operational excellence across pipelines...  ...for an existing vacancy. NVIDIA uses AI tools in its recruiting processes.... 
    Senior
    Night shift

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $165k - $220k

     ...CoreWeave is The Essential Cloud for AI™. Built for pioneers by...  ...with the internal and customer engineering teams, offering valuable...  .... About the role: As a Senior Specialist Field Engineer CoreWeave...  ...high-performance compute (HPC) environments Collaborate closely... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    10 days ago
  • $170.6k - $261.3k

     ...transportation on a global scale. Our Embodied AI teams are redefining what's possible...  ...the vehicle to a safe stop. As a Senior Software Engineer on the Secondary Driving System team...  ..., performance profiling, and observability for on-road incidents. Analyze and... 
    Senior
    Local area
    Remote work
    Work from home
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    6 days ago
  • $168k - $258.75k

     ...High Performance Computing (HPC) and Artificial Intelligence (AI) are key markets for NVIDIA. Researchers and scientists actively embrace...  ..., frameworks, and tools. We are looking for a Senior Developer Advocate Engineer to own the technical engagements for a rapidly... 
    Senior

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $152k - $241.5k

     ...sophisticated, distributed infrastructure. As an engineer on our team, you will play a key role in building the next generation of observability for a diverse set of sophisticated...  ...for an existing vacancy.  NVIDIA uses AI tools in its recruiting processes. NVIDIA... 
    Senior

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $184k - $287.5k

     ...operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and...  ...with SLOs, on-call, incident response, observability, and reliability practices. Exposure to... 
    Senior
    Remote work

    NVIDIA

    Santa Clara, CA
    5 days ago
  • $165k - $242k

     ...Senior Business Systems Engineer- Data Center Systems II Livingston, NJ /Bellevue, WA / Sunnyvale, CA CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a...  ...that keep our data centers observable, automated, and resilient at... 
    Senior
    Temporary work
    Casual work
    Work at office
    Immediate start
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    1 hour ago
  • $184k - $287.5k

     ...Become a Senior System Software Engineer on NVIDIA's AI Inference Operations Team, focusing on DevOps and Infrastructure Automation. Join a company revolutionizing...  ...and their container-based software stacks. Build observability that actually tells the truth about platform health... 
    Senior

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $203.45k - $344.3k

     ...Senior Staff AI Data Infrastructure/Pipeline Engineer Santa Clara, CA XPENG is a leading smart technology company at the forefront of innovation, integrating...  ...iteration. We look forward to building a reliable, observable, and cost-effective data pipeline that supports the... 
    Senior
    Full time
    Overseas

    XPENG

    Santa Clara, CA
    1 hour ago
  • $190.9k - $334.1k

     ...combination brings together Veza's AI-native Access Graph with...  ..., and AI agents. ( For engineers joining Veza today, this means...  ...is not a QA role. This is a senior engineering leadership position...  ...infrastructure is reliable, observable, and scalable as the platform... 
    Senior
    Work at office
    Remote work
    Flexible hours
    Shift work

    ServiceNow

    Santa Clara, CA
    10 days ago
  • $200k - $287.5k

     ...usher in this new era, we seek AI-native thinkers across every...  ...future of how work gets done. Observe by Snowflake is an AI-powered...  ...Snowflake AI Data Cloud and engineered for scale. We ingest and store...  ...platforms. We are hiring a Senior Software Engineer for the Observe... 
    Senior
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    1 day ago
  • $132k - $207k

     ...Applications for this job will be accepted at least until April 30, 2026. This posting is for an existing vacancy.  NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity... 
    Senior

    NVIDIA

    Santa Clara, CA
    21 hours ago
  • $152k - $241.5k

    NVIDIA is seeking a highly motivated Software Engineer to join our growing AI and Generative AI engineering team. In this role, you will contribute...  ...engineering teams to improve scalability, reliability, observability, and developer productivity across AI systems.... 
    Senior
    Full time

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $152k - $241.5k

     ...NVIDIA is currently seeking a Senior Developer Technology Engineer for High-Performance Databases! Would you enjoy researching new algorithms and memory...  ...This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed... 
    Senior

    NVIDIA

    Santa Clara, CA
    21 hours ago
  • $184k - $287.5k

     ...We’re currently seeking a Senior Developer Technology Engineer! NVIDIA's Developer Technology Engineering team is a global network of world-class experts...  ...This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed... 
    Senior
    Work experience placement

    NVIDIA

    Santa Clara, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior AI and HPC Observability Engineer. Be the first to apply!