Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

HPC Systems Reliability Intern

$20 - $71 per hour

NVIDIA Gruppe

NVIDIA Gruppe is looking for an intern to investigate and triage failures within large-scale compute clusters. The role requires proficiency in Python and shell scripting, alongside strong debugging skills in complex systems. Interns will work closely with mentors, receive competitive hourly rates ranging from $20 to $71, and be eligible for benefits, with applications accepted until May 31, 2026. #J-18808-Ljbffr NVIDIA Gruppe

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the HPC Systems Reliability Intern in Santa Clara, CA vacancy
  • $152k - $241.5k

     ...ll keep critically important systems running while working on the...  ...they integrate cleanly with HPC schedulers, storage, and network...  ...Quality of Service (QoS) for internal customers through operational...  ...lifecycle management, fleet reliability/auto‑healing, E2E observability... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $154.9k - $263.3k

     ...hands without us. KLA invents systems and solutions for the manufacturing...  .../Preferred Qualifications HPC server systems are...  ...and be cost effective, highly reliable and serviceable for 15+ years...  ...family care and bonding leave. Interns are eligible for some of the benefits... 
    Suggested
    Minimum wage
    Work experience placement
    Flexible hours

    KLA

    Milpitas, CA
    2 days ago
  • $95k - $161.5k

    KLA-Belgium is looking for a dedicated HPC Hardware Engineer in Milpitas, California. The role involves designing...  ...HPC clusters, validating hardware, and integrating systems across teams for optimal performance and reliability. The ideal candidate will have a Master's degree... 
    Suggested

    KLA-Belgium

    Milpitas, CA
    4 days ago
  • $110.5k - $152k

    ## Quality & Reliability Systems Engineer - (E3)Applylocations: Santa Clara,CAtime type: Full timeposted on: Posted Todayjob requisition id:...  ...related knowledge, skills, experience, and with consideration of internal equity of our current team members. In addition to a... 
    Suggested
    Full time
    Relocation

    Applied Materials

    Santa Clara, CA
    5 days ago
  • $80k - $85k

     ...world-class Software Test Engineers (aka System Test engineers) to help us in building...  ...ensuring their consistent functionality and reliability. Your job is to find the hard...  ...skills. Compensation: The intern base pay for this role is $80,000 to $85... 
    Internship
    Night shift
    3 days per week

    Arista Networks, Inc.

    Santa Clara, CA
    5 days ago
  • $147.4k - $220.9k

    Site Reliability Engineer, Customer Systems Sunnyvale, California, United States Software and Services Imagine what you could do here. Apple is a place where extraordinary people gather to do their best work. Together we craft products and experiences people once couldn... 
    Relocation

    Apple Inc.

    Sunnyvale, CA
    3 days ago
  • $152k - $241.5k

     ...NVIDIA Gruppe in Santa Clara is seeking a Senior Software Engineer to enhance their HPC infrastructure. The role involves applying distributed systems patterns, automation, and building scalable services in a hybrid multi-cloud environment. Candidates should have strong... 

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • $110.5k - $152k

    Applied Materials, Inc. in Santa Clara, CA is seeking a Quality & Reliability Systems Engineer (E3) to ensure product quality and reliability through testing and evaluation. This full-time position involves developing quality standards, implementing testing methods, and... 
    Full time

    Applied Materials, Inc.

    Santa Clara, CA
    2 days ago
  • $125k - $140k

     ...center design in innovative ways to deliver dramatic gains in reliability, efficiency and sustainability in flexible environments that can...  ...is responsible for the overall operating health of critical systems across Vantage global facilities. For each of the major systems... 
    Temporary work
    Work at office
    Local area
    Home office
    Flexible hours

    Vantage Data Centers

    Santa Clara, CA
    2 days ago
  •  ...Santa Clara is hiring for a role in their Hardware Infrastructure EDA Compute team to optimize workload scheduling systems and improve overall service reliability. The successful candidate will manage and scale job scheduling systems while driving measurable improvements in... 

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $20 - $71 per hour

     ...correlate specific job failures to system‑level issues and diagnostic...  ..., and reporting on key reliability metrics, specifically Mean Time...  ...high‑performance computing (HPC) environments, cluster managers...  .... The hourly rate for our interns is 20 USD - 71 USD. You will... 
    Internship
    Hourly pay

    NVIDIA Gruppe

    Santa Clara, CA
    14 hours ago
  • $208k - $253k

     ...strengthen Crusoe’s Hardware Systems Engineering team and close critical...  ..., deep issue resolution, and reliability across Crusoe Cloud’s GPU-...  ...manufacturing with both internal teams and external vendors....  ...and how to leverage them in AI/HPC environments. Expertise supporting... 
    Temporary work

    Crusoe

    Sunnyvale, CA
    5 days ago
  •  ...part of a core team that ensures safe, reliable, and scalable releases of the Autonomous...  ...collaborate closely with Release Engineers, Systems Engineers, DevOps, and AI/ML teams to...  ...compliance with relevant standards and internal governance. What You Must Have Strong... 
    Local area
    Work from home

    Israelvcforum

    Sunnyvale, CA
    2 days ago
  •  ...markets. As a Silicon Speed Features Engineer, you will co-design system-level speed features, build the validation and automation...  ...with system architects, hardware, firmware/software, process/reliability, and operations teams to co‑design system‑level speed features... 
    Full time

    NVIDIA AI

    Santa Clara, CA
    1 day ago
  • PlusAI is offering an exciting opportunity within the Systems Engineering team in Santa Clara, California. This role focuses on defining, tracking, and analyzing operational performance metrics related to Safety of the Intended Function. It requires ongoing improvement... 
    Internship

    PlusAI

    Santa Clara, CA
    3 days ago
  • $200k - $400k

     ...ultra-scale GPU supercomputing systems to train next-generation...  ...communication performance, distributed reliability, and cross-layer optimization...  ...with NCCL and/or UCX internals · Strong systems programming...  ...relevant distributed systems, HPC, or large-scale training... 
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    1 day ago
  • $165k - $242k

     ...Systems Engineer, Kernel Livingston, NJ / New York, NY / Sunnyvale...  ...improves the performance and reliability of our stack. This...  ...containerd, nydus, kubelet) HPC/AI workloads (CUDA, GPUDirect...  ...Deep understanding of kernel internals (memory management, scheduling... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    2 days ago
  • $16 - $95 per hour

    Job Overview This exciting opportunity within the Systems Engineering team at Plus will focus on continuously define, track and analyze operational performance metrics to quantify Safety of the Intended Function (ISO 21448) Risk Acceptance Criteria (RAC). Responsibilities... 
    Internship

    PlusAI

    Santa Clara, CA
    3 days ago
  •  ...Ellis Technologies, Inc. is seeking a Site Reliability Engineer Intern for Summer 2026 in San Jose. Responsibilities include ensuring the reliability of major data warehouse products and optimizing performance through automation and incident management. Ideal candidates... 
    Internship
    Hourly pay
    Summer work

    Ellis Technologies, Inc.

    San Jose, CA
    5 days ago
  • $19 - $65 per hour

     ...Companies. Partners including TRATON GROUP’s Scania, MAN, and International brands, Hyundai Motor Company, Iveco Group, Bosch, and DSV are...  ...Required Qualifications Actively enrolled in a Master’s degree in Systems Engineering or related Engineering study Previous experience... 
    Internship
    Hourly pay

    Ring Inc

    Santa Clara, CA
    3 days ago
  • $159.2k - $301.6k

     ...Adobe is looking for a Staff Software Engineer - AI/ML Systems, MLOps & Reliability to help build and scale the platform powering Adobe Experience Platform's Personalization ML solutions and Generative AI capabilities. This role sits at the intersection of... 
    Temporary work
    Local area
    Worldwide

    Adobe

    San Jose, CA
    1 day ago
  • $136k - $218.5k

    NVIDIA in Santa Clara is seeking a Silicon Speed Features Engineer to co-design system-level speed features across Gaming, Datacenter, Automotive, and Embedded markets. The role involves collaborating cross-functionally and using AI to enhance automation tools for performance... 

    NVIDIA

    Santa Clara, CA
    14 hours ago
  • $126k - $148.32k

     ...turbines, and fuel cells to quickly and reliably deliver local power for EV charging, commercial...  ...S. today, and we’re quickly scaling for international expansion. Inspired by our vision of the...  ...plans with design engineers based on system and sub-system specs and possess the... 
    Local area
    Flexible hours

    Ring

    Menlo Park, CA
    4 days ago
  • $175k - $230k

     ...AI/HPC System Engineer Job Title: AI/HPC System Engineer Office Location: San Jose, CA Job Type: Full-Time Work Model: Onsite...  ...system software. Verbal/written communication with both internal and external collaborator Qualifications: Ph.D. in... 
    Full time
    Work at office
    Local area

    SK hynix America Inc.

    San Jose, CA
    2 days ago
  • $66.4k - $99.6k

     ...Systems Test Engineering Intern Onto Innovation is a leader in process control, combining global scale with an expanded portfolio of leading...  ...most difficult yield, device performance, quality, and reliability issues. Onto Innovation strives to optimize customers' critical... 
    Internship
    Permanent employment

    Onto

    Milpitas, CA
    15 days ago
  •  ...leading firm in AI solutions is looking for an experienced AI Reliability Engineer (AI SRE) for a 12-month remote contract role. In this...  ...focus on ensuring the reliability and performance of critical AI systems by defining SLOs, implementing automated resilience measures,... 
    Contract work
    Remote work

    DeWinter Group

    Campbell, CA
    5 days ago
  • $188k - $275k

     ...building, and operating the company's core internal platforms and SaaS ecosystem. This team...  ..., identity, and edge infrastructure systems are secure, scalable, and well-governed....  ...scalable configurations, improving system reliability, and driving automation across the IT ecosystem... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    15 days ago
  •  ...Booster in Mountain View is looking for a Reliability and Risk Engineer to enhance the safety and reliability of autonomous trucking systems. The ideal candidate will have a Master's or PhD, along with over 5 years of related experience, and strong skills in statistical... 
    Work at office

    Booster

    Mountain View, CA
    5 days ago
  • $19 - $65 per hour

    Plus in Santa Clara is seeking a talented individual for an internship in Systems Engineering. You'll take ownership of deliverables, manage time and leverage AI tools effectively. Ideal candidates are pursuing a Master's degree in Systems Engineering with experience in... 
    Internship

    Plus

    Santa Clara, CA
    2 days ago
  • $155k - $263k

     ...hands without us. KLA invents systems and solutions for the manufacturing...  .../Preferred Qualifications HPC server systems are...  ...and be cost effective, highly reliable and serviceable for 15+ years...  ...family care and bonding leave. Interns are eligible for some of the benefits... 
    Remote job
    Minimum wage
    Work experience placement
    Flexible hours

    KLA

    Milpitas, CA
    more than 2 months ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to HPC Systems Reliability Intern. Be the first to apply!