HPC Systems Reliability Intern
$20 - $71 per hourNVIDIA Gruppe
NVIDIA Gruppe is looking for an intern to investigate and triage failures within large-scale compute clusters. The role requires proficiency in Python and shell scripting, alongside strong debugging skills in complex systems. Interns will work closely with mentors, receive competitive hourly rates ranging from $20 to $71, and be eligible for benefits, with applications accepted until May 31, 2026. #J-18808-Ljbffr NVIDIA Gruppe
$152k - $241.5k
...ll keep critically important systems running while working on the... ...they integrate cleanly with HPC schedulers, storage, and network... ...Quality of Service (QoS) for internal customers through operational... ...lifecycle management, fleet reliability/auto‑healing, E2E observability...Suggested$154.9k - $263.3k
...hands without us. KLA invents systems and solutions for the manufacturing... .../Preferred Qualifications HPC server systems are... ...and be cost effective, highly reliable and serviceable for 15+ years... ...family care and bonding leave. Interns are eligible for some of the benefits...SuggestedMinimum wageWork experience placementFlexible hours$95k - $161.5k
KLA-Belgium is looking for a dedicated HPC Hardware Engineer in Milpitas, California. The role involves designing... ...HPC clusters, validating hardware, and integrating systems across teams for optimal performance and reliability. The ideal candidate will have a Master's degree...Suggested$110.5k - $152k
## Quality & Reliability Systems Engineer - (E3)Applylocations: Santa Clara,CAtime type: Full timeposted on: Posted Todayjob requisition id:... ...related knowledge, skills, experience, and with consideration of internal equity of our current team members. In addition to a...SuggestedFull timeRelocation$80k - $85k
...world-class Software Test Engineers (aka System Test engineers) to help us in building... ...ensuring their consistent functionality and reliability. Your job is to find the hard... ...skills. Compensation: The intern base pay for this role is $80,000 to $85...InternshipNight shift3 days per week$147.4k - $220.9k
Site Reliability Engineer, Customer Systems Sunnyvale, California, United States Software and Services Imagine what you could do here. Apple is a place where extraordinary people gather to do their best work. Together we craft products and experiences people once couldn...Relocation$152k - $241.5k
...NVIDIA Gruppe in Santa Clara is seeking a Senior Software Engineer to enhance their HPC infrastructure. The role involves applying distributed systems patterns, automation, and building scalable services in a hybrid multi-cloud environment. Candidates should have strong...$110.5k - $152k
Applied Materials, Inc. in Santa Clara, CA is seeking a Quality & Reliability Systems Engineer (E3) to ensure product quality and reliability through testing and evaluation. This full-time position involves developing quality standards, implementing testing methods, and...Full time$125k - $140k
...center design in innovative ways to deliver dramatic gains in reliability, efficiency and sustainability in flexible environments that can... ...is responsible for the overall operating health of critical systems across Vantage global facilities. For each of the major systems...Temporary workWork at officeLocal areaHome officeFlexible hours- ...Santa Clara is hiring for a role in their Hardware Infrastructure EDA Compute team to optimize workload scheduling systems and improve overall service reliability. The successful candidate will manage and scale job scheduling systems while driving measurable improvements in...
$20 - $71 per hour
...correlate specific job failures to system‑level issues and diagnostic... ..., and reporting on key reliability metrics, specifically Mean Time... ...high‑performance computing (HPC) environments, cluster managers... .... The hourly rate for our interns is 20 USD - 71 USD. You will...InternshipHourly pay$208k - $253k
...strengthen Crusoe’s Hardware Systems Engineering team and close critical... ..., deep issue resolution, and reliability across Crusoe Cloud’s GPU-... ...manufacturing with both internal teams and external vendors.... ...and how to leverage them in AI/HPC environments. Expertise supporting...Temporary work- ...part of a core team that ensures safe, reliable, and scalable releases of the Autonomous... ...collaborate closely with Release Engineers, Systems Engineers, DevOps, and AI/ML teams to... ...compliance with relevant standards and internal governance. What You Must Have Strong...Local areaWork from home
- ...markets. As a Silicon Speed Features Engineer, you will co-design system-level speed features, build the validation and automation... ...with system architects, hardware, firmware/software, process/reliability, and operations teams to co‑design system‑level speed features...Full time
- PlusAI is offering an exciting opportunity within the Systems Engineering team in Santa Clara, California. This role focuses on defining, tracking, and analyzing operational performance metrics related to Safety of the Intended Function. It requires ongoing improvement...Internship
$200k - $400k
...ultra-scale GPU supercomputing systems to train next-generation... ...communication performance, distributed reliability, and cross-layer optimization... ...with NCCL and/or UCX internals · Strong systems programming... ...relevant distributed systems, HPC, or large-scale training...Visa sponsorship$165k - $242k
...Systems Engineer, Kernel Livingston, NJ / New York, NY / Sunnyvale... ...improves the performance and reliability of our stack. This... ...containerd, nydus, kubelet) HPC/AI workloads (CUDA, GPUDirect... ...Deep understanding of kernel internals (memory management, scheduling...Permanent employmentTemporary workCasual workWork at officeRemote workFlexible hours$16 - $95 per hour
Job Overview This exciting opportunity within the Systems Engineering team at Plus will focus on continuously define, track and analyze operational performance metrics to quantify Safety of the Intended Function (ISO 21448) Risk Acceptance Criteria (RAC). Responsibilities...Internship- ...Ellis Technologies, Inc. is seeking a Site Reliability Engineer Intern for Summer 2026 in San Jose. Responsibilities include ensuring the reliability of major data warehouse products and optimizing performance through automation and incident management. Ideal candidates...InternshipHourly paySummer work
$19 - $65 per hour
...Companies. Partners including TRATON GROUP’s Scania, MAN, and International brands, Hyundai Motor Company, Iveco Group, Bosch, and DSV are... ...Required Qualifications Actively enrolled in a Master’s degree in Systems Engineering or related Engineering study Previous experience...InternshipHourly pay$159.2k - $301.6k
...Adobe is looking for a Staff Software Engineer - AI/ML Systems, MLOps & Reliability to help build and scale the platform powering Adobe Experience Platform's Personalization ML solutions and Generative AI capabilities. This role sits at the intersection of...Temporary workLocal areaWorldwide$136k - $218.5k
NVIDIA in Santa Clara is seeking a Silicon Speed Features Engineer to co-design system-level speed features across Gaming, Datacenter, Automotive, and Embedded markets. The role involves collaborating cross-functionally and using AI to enhance automation tools for performance...$126k - $148.32k
...turbines, and fuel cells to quickly and reliably deliver local power for EV charging, commercial... ...S. today, and we’re quickly scaling for international expansion. Inspired by our vision of the... ...plans with design engineers based on system and sub-system specs and possess the...Local areaFlexible hours$175k - $230k
...AI/HPC System Engineer Job Title: AI/HPC System Engineer Office Location: San Jose, CA Job Type: Full-Time Work Model: Onsite... ...system software. Verbal/written communication with both internal and external collaborator Qualifications: Ph.D. in...Full timeWork at officeLocal area$66.4k - $99.6k
...Systems Test Engineering Intern Onto Innovation is a leader in process control, combining global scale with an expanded portfolio of leading... ...most difficult yield, device performance, quality, and reliability issues. Onto Innovation strives to optimize customers' critical...InternshipPermanent employment- ...leading firm in AI solutions is looking for an experienced AI Reliability Engineer (AI SRE) for a 12-month remote contract role. In this... ...focus on ensuring the reliability and performance of critical AI systems by defining SLOs, implementing automated resilience measures,...Contract workRemote work
$188k - $275k
...building, and operating the company's core internal platforms and SaaS ecosystem. This team... ..., identity, and edge infrastructure systems are secure, scalable, and well-governed.... ...scalable configurations, improving system reliability, and driving automation across the IT ecosystem...Permanent employmentTemporary workCasual workWork at officeFlexible hours- ...Booster in Mountain View is looking for a Reliability and Risk Engineer to enhance the safety and reliability of autonomous trucking systems. The ideal candidate will have a Master's or PhD, along with over 5 years of related experience, and strong skills in statistical...Work at office
$19 - $65 per hour
Plus in Santa Clara is seeking a talented individual for an internship in Systems Engineering. You'll take ownership of deliverables, manage time and leverage AI tools effectively. Ideal candidates are pursuing a Master's degree in Systems Engineering with experience in...Internship$155k - $263k
...hands without us. KLA invents systems and solutions for the manufacturing... .../Preferred Qualifications HPC server systems are... ...and be cost effective, highly reliable and serviceable for 15+ years... ...family care and bonding leave. Interns are eligible for some of the benefits...Remote jobMinimum wageWork experience placementFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to HPC Systems Reliability Intern. Be the first to apply!


