Senior AI and HPC Observability Engineer
$152k - $241.5kNVIDIA
NVIDIA is a pioneer in accelerated computing, known for inventing the GPU and driving breakthroughs in gaming, computer graphics, high-performance computing, and artificial intelligence. Our technology powers everything from generative AI to autonomous systems, and we continue to shape the future of computing through innovation and collaboration. Within this mission, our team, Managed AI Superclusters (MARS) builds and scales the infrastructure, platforms, and tools that enable researchers and engineers to develop the next generation of AI/ML systems. By joining us, you’ll help design solutions that power some of the world’s most advanced computing workloads.
Observability is at the heart of this transformation. We are looking for a strong AI & HPC Observability Engineer to build and scale next-generation Observability and Telemetry platforms. You will design and develop high-throughput, reliable telemetry pipelines and modern data infrastructure. This role requires solid distributed systems fundamentals, production-grade coding, and a passion for operational excellence.
What You Will Be Doing:
Design and scale observability platforms handling high-volume metrics, logs, and traces across distributed environments
Build high-performance backend services for telemetry ingestion, processing, and routing
Develop and extend OpenTelemetry collectors, processors, exporters, and instrumentation libraries
Build and optimize metrics pipelines using large-scale time-series storage systems
Design and operate real-time and batch telemetry pipelines using streaming and distributed data technologies
Improve platform reliability, performance, and cost efficiency through tuning, capacity planning, and system optimization
Develop monitoring, alerting, and service reliability frameworks to ensure platform health and performance
Collaborate with platform engineering, infrastructure, and site reliability teams to deliver production-grade observability solutions
What We Need to see:
Bachelor’s degree in Computer Science, Computer Engineering, or related field or equivalent experience
5+ years of experience building backend or distributed systems in production environments
Strong programming skills in Python, Go, or Java, with experience developing production-quality software
Hands-on experience with modern observability architectures, including metrics, logs, and traces
Solid experience with PromQL and time-series data systems
Experience building or operating distributed data pipelines using technologies such as Kafka, Spark, or Flink
Experience working with Kubernetes and cloud-native infrastructure
Strong understanding of distributed systems, concurrency, and fault-tolerant system design. Strong debugging, performance tuning, and production operations skills
Ways To Stand Out from The Crowd:
Proven experience designing and scaling observability platforms for AI, GPU, or HPC environments
Hands-on expertise with OpenTelemetry, Prometheus, Kafka, and high-volume distributed telemetry pipelines
Strong background in data engineering, time-series data modeling, and real-time performance tuning
Experience integrating observability with AI/ML pipelines, GPU workload monitoring, or intelligent alerting
Demonstrated use of statistical or machine learning techniques for anomaly detection, correlation, or predictive insights
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4.
You will also be eligible for equity and benefits ( .
Applications for this job will be accepted at least until March 6, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
$184k - $287.5k
...NVIDIA Math Libraries team is looking for a senior engineer to join our development efforts in the area of kernel generation for AI and HPC, specifically targeting matrix operations, JITing and fusions. Around the world, leading commercial and academic organizations are...Senior$152k - $241.5k
...into the unlimited potential of AI to define the next era of... ...world. We are looking for a Senior Software Engineer to join our mission to continue improving our HPC infrastructure. Our team... ...implementation, testing, rollout, observability, and iterative improvement....Senior$152k - $241.5k
...tapping into the unlimited potential of AI to define the next era of computing.... ...seeking a highly skilled and experienced HPC Cluster Engineer to design, deploy, and operate GPU... ...infrastructure provisioning, management, observability and day to day operation through...Senior$184k - $287.5k
...the unlimited potential of AI to define the next era of computing... ...and management. As a Senior Software Engineer - Datacenter Systems, you will... ...run today's fastest HPC and AI workloads. This role... ...and SLAs. Proficiency with observability tools such as Prometheus and...SeniorRemote work$152k - $241.5k
...role drives improvements in observability, service reliability, and automation... ..., and aligned with long-term engineering demands. What you'll be... ...(LSF, Slurm, etc.) in HPC or silicon design environments... ...existing vacancy. NVIDIA uses AI tools in its recruiting...Senior$90k - $215k
...Senior Software Engineer- Observability and Reliability Platform Engineering (REMOTE) Senior Software Engineer- Observability and Reliability Platform... ...operations, real-time communication) Knowledge in ML and AI technologies Knowledge on Open-source monitoring software...SeniorHourly payFull timeWork experience placementLocal areaRemote workFlexible hours$125k - $175k
...Role Overview We are hiring a hands-on Senior Software & AI Test Engineer to design and operationalize a scalable, automation-first quality... ...design • Definition of done includes validation and observability • Partner with engineering and product to define: •...SeniorShift work$140k - $224.25k
...markets include gaming, automotive, vision, HPC, datacenters and networking in addition... .... NVIDIA is also well positioned as the ‘AI Computing Company’, and NVIDIA GPUs are... ...experience) in a STEM (Science, Technology, Engineering, Math or Physics) field ~5+ years...Senior$184k - $287.5k
...tapping into the unlimited potential of AI to define the next era of computing. An era... ...-gen distributed storage services for HPC workloads, optimizing both performance and... ...s degree in Computer Science, Electrical Engineering or related field or equivalent experience...Senior$168k - $270.25k
...Senior Engineer For Factory Infrastructure And Automation NVIDIA is the platform upon which every new AI-powered application is built. We are seeking a senior engineer to design... ...modeling and schema design, and expanding observability over the factory pipeline and its...Senior$139k - $204k
...Senior Software Engineer, Storage Engineer Livingston, NJ/ New York, NY / Sunnyvale, CA / Bellevue... ...CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave... ...the reliability, durability, and observability of our storage stack. Collaborate with...SeniorPermanent employmentTemporary workCasual workWork at officeRemote workFlexible hours$224k - $356.5k
...At NVIDIA, our Financial Systems Engineering team is at the heart of ensuring that our massive... ...), including Kubernetes, Docker, CI/CD, observability, and reliability engineering. Your... ...an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA...Senior$184k - $287.5k
...We are looking for a Senior Software Engineer to help build NeMo Platform, NVIDIA’s product for developing... ...evaluating, deploying, and operating AI systems at scale. This role will focus... ...need practical infrastructure for observing behavior, measuring progress, catching...SeniorRemote work$184k - $287.5k
...We are seeking highly skilled and motivated software engineers to join us and build AI inference systems that serve large-scale models with extreme... .../Azure), infrastructure as code, CI/CD, and production observability. Contributions to open-source projects and/or...Senior$190k - $282k
...Senior Security Production Engineer Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA / San Francisco... ...is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave... ...security infrastructure, enhancing observability, and responding to production...SeniorPermanent employmentTemporary workCasual workWork at officeRemote workFlexible hours$184k - $287.5k
...seeking a highly skilled and experienced Senior DevOps Engineer to join NVIDIA’s Robotics DevOps team!... ...‑owned problems. Improve observability and operational excellence across pipelines... ...for an existing vacancy. NVIDIA uses AI tools in its recruiting processes....SeniorNight shift$165k - $220k
...CoreWeave is The Essential Cloud for AI™. Built for pioneers by... ...with the internal and customer engineering teams, offering valuable... .... About the role: As a Senior Specialist Field Engineer CoreWeave... ...high-performance compute (HPC) environments Collaborate closely...SeniorPermanent employmentTemporary workCasual workWork at officeRemote workFlexible hours$170.6k - $261.3k
...transportation on a global scale. Our Embodied AI teams are redefining what's possible... ...the vehicle to a safe stop. As a Senior Software Engineer on the Secondary Driving System team... ..., performance profiling, and observability for on-road incidents. Analyze and...SeniorLocal areaRemote workWork from homeRelocation packageFlexible hours$168k - $258.75k
...High Performance Computing (HPC) and Artificial Intelligence (AI) are key markets for NVIDIA. Researchers and scientists actively embrace... ..., frameworks, and tools. We are looking for a Senior Developer Advocate Engineer to own the technical engagements for a rapidly...Senior$152k - $241.5k
...sophisticated, distributed infrastructure. As an engineer on our team, you will play a key role in building the next generation of observability for a diverse set of sophisticated... ...for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA...Senior$184k - $287.5k
...operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and... ...with SLOs, on-call, incident response, observability, and reliability practices. Exposure to...SeniorRemote work$165k - $242k
...Senior Business Systems Engineer- Data Center Systems II Livingston, NJ /Bellevue, WA / Sunnyvale, CA CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a... ...that keep our data centers observable, automated, and resilient at...SeniorTemporary workCasual workWork at officeImmediate startRemote workFlexible hours$184k - $287.5k
...Become a Senior System Software Engineer on NVIDIA's AI Inference Operations Team, focusing on DevOps and Infrastructure Automation. Join a company revolutionizing... ...and their container-based software stacks. Build observability that actually tells the truth about platform health...Senior$203.45k - $344.3k
...Senior Staff AI Data Infrastructure/Pipeline Engineer Santa Clara, CA XPENG is a leading smart technology company at the forefront of innovation, integrating... ...iteration. We look forward to building a reliable, observable, and cost-effective data pipeline that supports the...SeniorFull timeOverseas$190.9k - $334.1k
...combination brings together Veza's AI-native Access Graph with... ..., and AI agents. ( For engineers joining Veza today, this means... ...is not a QA role. This is a senior engineering leadership position... ...infrastructure is reliable, observable, and scalable as the platform...SeniorWork at officeRemote workFlexible hoursShift work$200k - $287.5k
...usher in this new era, we seek AI-native thinkers across every... ...future of how work gets done. Observe by Snowflake is an AI-powered... ...Snowflake AI Data Cloud and engineered for scale. We ingest and store... ...platforms. We are hiring a Senior Software Engineer for the Observe...SeniorFlexible hours$132k - $207k
...Applications for this job will be accepted at least until April 30, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity...Senior$152k - $241.5k
NVIDIA is seeking a highly motivated Software Engineer to join our growing AI and Generative AI engineering team. In this role, you will contribute... ...engineering teams to improve scalability, reliability, observability, and developer productivity across AI systems....SeniorFull time$152k - $241.5k
...NVIDIA is currently seeking a Senior Developer Technology Engineer for High-Performance Databases! Would you enjoy researching new algorithms and memory... ...This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed...Senior$184k - $287.5k
...We’re currently seeking a Senior Developer Technology Engineer! NVIDIA's Developer Technology Engineering team is a global network of world-class experts... ...This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed...SeniorWork experience placement
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior AI and HPC Observability Engineer. Be the first to apply!
- machine learning ai engineer Santa Clara, CA
- senior ai engineer Santa Clara, CA
- ai engineer remote Santa Clara, CA
- ai ml engineer Santa Clara, CA
- ai engineer Santa Clara, CA
- ai developer Santa Clara, CA
- ai prompt engineer Santa Clara, CA
- senior development executive Santa Clara, CA
- senior technical manager Santa Clara, CA
- senior manager data science Santa Clara, CA

