Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

HPC Engineer

$150k - $300k

Institute of Foundation Models

About MBZUAI The Institute for Foundation Models (IFM) operates some of the world's largest AI supercomputing environments. Position Summary This role provides operational coverage during Abu Dhabi overnight hours and serves as a primary point of contact for infrastructure monitoring, incident triage, researcher support, and production operations. Responsibilities Monitor health, performance, and availability of large-scale GPU clusters. Respond to incidents and perform first-level triage. Support researchers and troubleshoot job failures. Execute operational runbooks and recovery procedures. Validate cluster deployments, upgrades, and maintenance activities. Track infrastructure utilization and operational metrics. Develop automation and monitoring tools. Contribute to documentation and reporting. Education Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Information Technology, Electrical Engineering, Mathematics, Physics, or related disciplines. Experience 2+ years in Linux systems administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations. Strong Linux troubleshooting skills. Experience with scripting using Python or Bash. Preferred Qualifications Slurm. GPU infrastructure. AWS, Azure, or GCP. Grafana, Prometheus, Datadog, or similar tools. Containers and Kubernetes. AI/ML infrastructure exposure. Research computing environments. Salary Range $150,000 - $300,000 a year The posted salary range represents the company’s good faith estimate of the compensation for this position upon hire. The actual compensation offered may vary within this range depending on individual qualifications, including but not limited to relevant skills, experience, education, certifications, geographic location, and specific business needs. Benefits Include Comprehensive medical, dental, and vision benefits Bonus 401K Plan Generous paid time off, sick leave and holidays Paid Parental Leave Employee Assistance Program Life insurance and disability #J-18808-Ljbffr

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the HPC Engineer in Sunnyvale, CA vacancy
  • $124k - $195.5k

    NVIDIA Gruppe in Santa Clara seeks an HPC Operations Engineer to design and implement compute clusters for silicon development. Ideal candidates will have experience troubleshooting in large-scale environments and enhancing deployment automation. Applicants should be proficient... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $152k - $241.5k

    NVIDIA Gruppe in Santa Clara is seeking a Senior Software Engineer to enhance their HPC infrastructure. The role involves applying distributed systems patterns, automation, and building scalable services in a hybrid multi-cloud environment. Candidates should have strong... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    9 hours ago
  • $124k - $195.5k

     ...clusters at high reliability, efficiency, and performance, and drive foundational improvements and automation to improve engineers’ productivity. As an HPC Operations Engineer, you are responsible for the big picture of how our systems relate to each other, using a breadth... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  •  ...environment remains resilient, measurable, and aligned with long‑term engineering demands. What you’ll be doing: Manage, scale, and optimize job...  ...and tuning job scheduling systems (LSF, Slurm, etc.) in HPC or silicon design environments Proficiency in Linux systems administration... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $152k - $241.5k

    Overview NVIDIA is looking for a Senior Site Reliability Engineer (SRE) to join our Compute Farm team and help build the next generation...  ...and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics. Use IaC (Infrastructure... 
    Suggested

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $152k - $241.5k

     ...and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics. Use IaC (...  ...programming languages such as Python, Go, Perl, or Ruby. Mentored other engineers and influenced technical direction through design reviews, architecture... 

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • NVIDIA Gruppe is looking for a senior engineer to join their Math Libraries team in Santa Clara, California. This role involves designing...  ...generation. The ideal candidate has over 8 years of experience in HPC software development using C++, along with leadership skills and... 

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • NVIDIA is hiring a highly skilled and experienced HPC Cluster Engineer to design, deploy and operate GPU Compute Clusters for EDA and high‑performance computing workloads across multiple teams and projects. What you’ll be doing: Develop and enhance the ecosystem around... 

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $159.5k - $271.2k

     ...invest 15% of sales back into R&D. Our expert teams of physicists, engineers, data scientists and problem-solvers work together with the...  ...deployment, and operational support of a high-performance computing (HPC) cluster platform used across IC fabrication facilities and mask... 
    Minimum wage
    Work experience placement
    Flexible hours

    KLA

    Milpitas, CA
    9 hours ago
  • Schlumberger is seeking a High Performance Computing (HPC) Engineer in Sunnyvale, CA, to tackle complex discrete optimization problems. The ideal candidate will have a strong background in operations research and advanced mathematical programming, along with hands-on experience... 

    Schlumberger

    Sunnyvale, CA
    3 days ago
  • A leading computer technology company in California seeks an HPC Operations Engineer to provide leadership in designing and implementing compute clusters. You will troubleshoot issues, enhance automation, and collaborate across diverse teams to improve systems. The role... 

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $184k - $356.5k

    Senior Math Libraries Engineer - AI and HPC page is loaded## Senior Math Libraries Engineer - AI and HPClocations: US, CA, Santa Clara: US, PA, Remote: US, WA, Remote: US, CA, Remote: US, MA, Remotetime type: Full timeposted on: Posted Todayjob requisition id: JR1998721... 
    Remote work

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • Onesubsea is looking for a highly skilled High Performance Computing (HPC) Engineer in Sunnyvale, California. The ideal candidate will have deep expertise in operations research and quantum computing technologies, focusing on solving complex optimization problems. This... 

    Onesubsea

    Sunnyvale, CA
    3 days ago
  • $184k - $287.5k

     ...implement scalable, next‑gen distributed storage services for HPC workloads, optimizing both performance and cost‑effectiveness to...  ...need to see Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience. 8+ years of experience... 

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • NVIDIA Gruppe in Santa Clara, California is seeking a skilled HPC/AI Benchmarking and Telemetry Engineer to join their team. In this role, you will develop benchmarking approaches for large-scale HPC and AI clusters, create telemetry frameworks to capture performance data... 

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  •  ...combines large-scale distributed systems, cloud platforms, and HPC environments to support cutting‑edge research and production workloads...  ...our culture on About the Role We are looking for Systems Engineers / System Administrators to help design, operate, and scale the infrastructure... 

    Mistral

    Palo Alto, CA
    1 day ago
  • A leading cloud technology company is seeking a highly skilled HPC Performance Engineer to join their HAVOCK Team in Sunnyvale, California. In this role, you will optimize bare-metal systems and ensure the performance of complex workloads using various technologies including... 

    CoreWeave

    Sunnyvale, CA
    2 days ago
  • $165k - $242k

     ...company (Nasdaq: CRWV) in March 2025. Learn more at What You’ll Do CoreWeave is seeking a highly skilled and motivated HPC Performance Engineer to join our HAVOCK Team, reporting into the Manager of Systems Engineering. In this role, you will play a crucial part in... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    2 days ago
  • $165k - $220k

     ...CX organization aligns closely with the internal and customer engineering teams, offering valuable insights from the field and having the...  ...focusing on AI/ML workloads within high-performance compute (HPC) environments Collaborate closely with customers to understand... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    5 days ago
  • $95k - $161.5k

    KLA-Belgium in Milpitas is seeking a talented individual to design and configure high-performance computing (HPC) clusters. The role involves selecting and validating hardware components, integrating systems, and collaborating with cross-functional teams for successful... 

    KLA-Belgium

    Milpitas, CA
    1 day ago
  • $95k - $161.5k

     ...invest 15% of sales back into R&D. Our expert teams of physicists, engineers, data scientists and problem‑solvers work together with the...  ...Europe and North America. Key Responsibilities Design & configure HPC clusters - Support development of compute cluster architectures... 
    Work experience placement
    Internship
    Worldwide
    Flexible hours

    KLA-Belgium

    Milpitas, CA
    2 days ago
  • $163k - $237k

    Google Inc. in Sunnyvale, CA is seeking an Integrated Circuit Package Design Engineer to develop package designs for machine learning high-performance computers. You will coordinate with thermal, mechanical, and chip design teams to produce cutting-edge hardware solutions... 

    Google Inc.

    Sunnyvale, CA
    4 days ago
  • $163k - $237k

     ...Sunnyvale, CA. The role involves developing advanced package substrate designs for Machine Learning chips and collaborating with engineering teams to enhance product performance and efficiency. The ideal candidate will have a Bachelor's degree in a relevant field and extensive... 

    Google Inc.

    Sunnyvale, CA
    4 days ago
  • Mistral in Palo Alto seeks Systems Engineers/System Administrators to design, operate, and scale the infrastructure behind AI platforms. This...  ...requires strong Linux administration skills and experience with HPC or cloud environments. Join a team focused on performance... 

    Mistral

    Palo Alto, CA
    2 days ago
  • MatX Inc. in Mountain View, CA, is seeking a Mechanical and Thermal Reliability Engineer to develop and execute reliability strategies for advanced infrastructure systems. The role involves conducting failure analysis, improving liquid cooling systems' reliability, and... 

    MatX Inc.

    Mountain View, CA
    9 hours ago
  • $136k - $218.5k

    NVIDIA AI is seeking a Hardware Applications Engineer in Santa Clara, California. The ideal candidate will focus on customer enablement of enterprise products in a datacenter environment, creating technical documentation and engaging with customers to resolve issues. This... 

    NVIDIA AI

    Santa Clara, CA
    2 days ago
  • $92k - $155.25k

    NVIDIA Gruppe is seeking an AI Compute Engineer to enhance AI Compute systems within the Infrastructure Specialists team in Santa Clara, California. You will manage and validate Linux-based customer infrastructures while interacting with stakeholders to implement extensive... 

    NVIDIA Gruppe

    Santa Clara, CA
    1 day ago
  • $90k - $110k

     ...CRWV) in March 2025. Learn more at About the Role At CoreWeave we are seeking a dedicated and detail-oriented Operations Engineer to join our HPC Networking Team. HPC Networking at CoreWeave is tasked with developing and operating some of the largest InfiniBand... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    5 days ago
  • $163k - $237k

    Google is seeking a Chip Package Signal Integrity/Power Integrity Engineer to join the ML, Systems, & Cloud AI organization. You will contribute to developing custom silicon solutions that power the future of Google's products. Your responsibilities include managing chip... 

    Google

    Sunnyvale, CA
    9 hours ago
  • Dell is seeking a Senior Director, CTIO Engineering Technologists to lead a team focused on creating software and hardware IT solutions for large data center customers. In this pivotal role, you'll drive the development of AI Systems and engage cross-functionally to ensure... 

    Dell

    Santa Clara, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to HPC Engineer. Be the first to apply!