HPC Systems Engineer
University of California , San Francisco
The CoreHPC team at UCSF is seeking an HPC Systems Engineer to play a key role in the development, maintenance, and day-to-day operations of the Institute's HPC clusters.
The HPC Systems Engineer will:
- Apply advanced systems infrastructure concepts and skills to the operations and improvement of large-scale and highly complex research Cyber Infrastructure (CI) with unique computing, networking, and storage systems designed to address cutting-edge research problems
- Apply their engineering and design skills to develop new CI solutions, to develop and enhance monitoring to maintain the integrity of CI systems.
- Select methods, techniques and evaluation criteria to develop new CI solutions to address complex research problems.
- Be an active member of the support and maintenance efforts for the CoreHPC cluster, resolving user issues, fixing technical problems, resolving outages, patching, and maintaining systems' uptime and availability.
- Provides consultation, support, and guidance to researchers on how to address computational problems using standard tools, packages, and approaches.
- Develop enhancements of monitoring to maintain the integrity of CI systems.
- Participate in multiple technical projects simultaneously.
- Applies working knowledge of security control frameworks to maintain the integrity of the CI systems and the research being performed on them.
- Gives presentations to the associated team and other technical units.
Evaluates new technologies, including performing moderate to complex cost/benefit analyses.
This position may lead to cross-functional technical working groups and projects in support of onboarding research customers, or making systems improvements.
Department Overview
Academic Research Systems (ARS) serves the needs of the UCSF research community by providing an integrated repository of HIPAA-compliant clinical and life sciences data and a centralized, secure, professionally managed infrastructure for the storage and management of research data. ARS empowers medical scientific investigations by offering secure computing environments, data capture, management and analysis tools, and support services which meet researchers' needs.
The Core HPC team of the Academic Research Service (ARS) focuses on large-scale, high-performance computational and storage services for UCSF researchers so they can address complex computational, AI, and data science problems.
%
of time
Essential Function (Yes/No )
Key Responsibilities
(To be completed by Supervisor)
15
Applies advanced systems / infrastructure concepts to define, design, implement, and operate highly complex, research cyberinfrastructure systems, services and technology solutions. Proposes and implements highly complex system or device enhancements such as software, hardware and network configuration, updates and installations for projects or services of broad scope. Sets standards for monitoring and maintaining the health and integrity of CI systems including upgrading and patching.15
Independently manages systems and services for a large facility, campuswide, medical center or Office of the President and / or institution-wide scope and makes recommendations for purchases or upgrades. Performs complex and advanced analysis to acquire, install, modify and support operating systems, databases, utilities and web-related tools. Selects methods and techniques to obtain solutions. Interacts with senior management. May perform complex network integration tasks and interoperability assessments for interconnected servers or components of clusters for communication. Support and collaborate with researchers and other key IT (e.g. network and security) and Data Center partners in a timely manner15
Specifies, writes and executes highly complex software and scripts to support systems management, log analysis, monitoring, deployment, configuration management, and other system administration duties for multiple, highly integrated systems.30
Provides consultation, training, support, and guidance to researchers enabling them to utilize HPC resources effectively.10
Maintains complex security systems. Interprets and adopts campus, medical center or Office of the President, system and regulation-based security policies to control access to networked resources. Provides recommendations and requirements on network access controls.5
Collaborates and may provide leadership with other Systems Engineers within the CI ecosystem/higher-education community. Regularly contribute best practices documentation, present at conferences, or publish in peer reviewed journals.10
Define and track performance metrics to ensure efficient current and future use of cyber infrastructure resources.100%
(To update total %, enter the amount of time in whole numbers (without the % symbol - e.g., 15, 20) then highlight the total sum (e.g., 1%) at the bottom of the column and press F9. The total sum should add up to 100%.)REQUIRED QUALIFICATIONS
- Bachelor's degree in a related area such as computer science or engineering, and 6+ years of experience with large-scale or HPC systems * or* 10+ years of related experience with large-scale or HPC systems
- Expert knowledge of HPC systems infrastructure design
- Strong knowledge of high-performance parallel filesystems and storage such as GPFS, Lustre, Vast, DDN, etc.
- Advanced knowledge of computer security best practices and policies including demonstrated experience securing research cyberinfrastructure systems to meet NIST 800-171 / 800-223, HIPPA or IS-3 requirements
- Demonstrated testing and test planning skills. Demonstrated ability to create automated testing.
- Knowledge of HPC job scheduler system design and operation such as SLURM or PBS,
- Demonstrated skill (5 years +) deploying, managing, and troubleshooting Warewulf (or similar) infiniband based clusters
- Ability to elicit and communicate technical and non-technical information in a clear and concise manner.
- Self-motivated and works independently and as part of a team. Demonstrates problem-solving skills. Able to learn effectively and meet deadlines.
- Understanding of system performance monitoring and actions that can be taken to improve or correct performance.
- Demonstrated advanced knowledge, skills and abilities associated with system problem identification and resolution. Experience with design, configuration, operation, repair, and tuning of technology systems.
- Advanced experience writing and editing the most complex scripts used to perform system maintenance and administration.
-
Ability to write technical documentation in a clear and concise manner. Ability to develop runbooks defining complex technical processes in a clear and concise manner
PREFERRED QUALIFICATIONS
- Knowledge of the design, development, and application of technology and systems to meet business needs.
- General knowledge of other areas of IT. Thorough understanding of and experience with systems-related issues and actions that can be taken to improve or correct performance.
- Demonstrated skills associated with adapting equipment and technology to serve user needs. Demonstrated comprehensive understanding of how system management actions affect other systems, system users and dependent/related functions.
REQUIRED QUALIFICATIONS
- Bachelor's degree in a related area such as computer science or engineering, and 6+ years of experience with large-scale or HPC systems * or* 10+ years of related experience with large-scale or HPC systems
- Expert knowledge of HPC systems infrastructure design
- Strong knowledge of high-performance parallel filesystems and storage such as GPFS, Lustre, Vast, DDN, etc.
- Advanced knowledge of computer security best practices and policies including demonstrated experience securing research cyberinfrastructure systems to meet NIST 800-171 / 800-223, HIPPA or IS-3 requirements
- Demonstrated testing and test planning skills. Demonstrated ability to create automated testing.
- Knowledge of HPC job scheduler system design and operation such as SLURM or PBS,
- Demonstrated skill (5 years +) deploying, managing, and troubleshooting Warewulf (or similar) infiniband based clusters
- Ability to elicit and communicate technical and non-technical information in a clear and concise manner.
- Self-motivated and works independently and as part of a team. Demonstrates problem-solving skills. Able to learn effectively and meet deadlines.
- Understanding of system performance monitoring and actions that can be taken to improve or correct performance.
- Demonstrated advanced knowledge, skills and abilities associated with system problem identification and resolution. Experience with design, configuration, operation, repair, and tuning of technology systems.
- Advanced experience writing and editing the most complex scripts used to perform system maintenance and administration.
-
Ability to write technical documentation in a clear and concise manner. Ability to develop runbooks defining complex technical processes in a clear and concise manner
PREFERRED QUALIFICATIONS
- Knowledge of the design, development, and application of technology and systems to meet business needs.
- General knowledge of other areas of IT. Thorough understanding of and experience with systems-related issues and actions that can be taken to improve or correct performance.
- Demonstrated skills associated with adapting equipment and technology to serve user needs. Demonstrated comprehensive understanding of how system management actions affect other systems, system users and dependent/related functions.
$160k - $320k
...those who show initiative and deliver excellence. We seek engineers/researchers with strong intrinsic drive, a true passion... ...Westwood, Los Angeles. About the Role We’re looking for a systems engineer with HPC or parallel programming experience to help scale AI...SuggestedFull timeWork at office$160k - $320k
A leading AI computing firm is seeking a Systems Engineer in San Francisco or Los Angeles to scale AI inference. Candidates should have strong C++ skills, HPC experience, and knowledge of parallel programming techniques. Responsibilities include designing GPU kernels, optimizing...Suggested$156.86k - $191.72k
...System Infrastructure / Platform Engineer The National Energy Research Scientific Computing Center (NERSC) is seeking a System Infrastructure / Platform Engineer to help build and manage HPC systems and Linux-based infrastructure. NERSC operates some of the world's...SuggestedFull timeRemote workFlexible hours$150k - $240k
10x Science, based in San Francisco, is looking for a founding engineer to contribute to a high-performance computing platform for AI-powered... ...from day one. Ideal candidates excel in low-level system architecture and possess strong problem-solving skills. Compensation...Suggested$200k - $300k
...problem space. If that is you, please apply! About the Role As a System Engineer, you will manage, operate, and optimize hyperscale GPU compute... ...drivers, monitoring tools (nvidia-smi, DCGM) Experience with HPC cluster management, job schedulers (Slurm, PBS, LSF), and...SuggestedLocal area$137k - $161k
...Crusoe Systems Engineer II, Compute Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically... ...Infiniband/ROCe NICs, Ephemeral Storage, etc.) in cutting-edge AI/HPC environments. Kernel & Hypervisor Integration: Work side by...Full timeTemporary work- A pioneering AI-driven drug discovery firm is seeking an AI Engineer in San Francisco. The role involves developing AI foundation models... ...programming skills, proficiency in Python, and an understanding of HPC infrastructure. Candidates should have a degree in a relevant...
- ...The Lawrence Berkeley National Laboratory is seeking a System Infrastructure / Platform Engineer in Berkeley, California to manage high-performance computing (HPC) systems and Linux-based infrastructure. This role entails managing complex environments, working with cutting...
- ...for Advanced Computing (FAC), storage and systems, including support for large storage... ...research workloads. The Storage Systems Engineer will: Work with the lead to continue... ...Ensure tight integration between storage and HPC compute systems, optimizing throughput,...Work experience placementWorldwide
- ...for Advanced Computing (FAC), storage and systems, including support for large storage... ...Ensure tight integration between storage and HPC compute systems, optimizing throughput, latency... ...related area such as computer science or engineering, and 6+ years of experience with storage...Work experience placement
$172.5k - $210k
...believes in each other, come build with us at Crusoe. Senior Systems Performance Engineer San Francisco, Sunnyvale (Onsite) Role Mission At Crusoe,... ...), and analyze RTL via simulation waveforms. Security & HPC: Experience with performance modeling for secure environments...$148.5k - $161k
...Location Type On-site Department Cloud Engineering Crusoe is on a mission to accelerate the... ...future of the climate. As a Senior Storage Systems Engineer , you will be the primary... ...required for world-class AI training and HPC workloads. You will lead the day-to-day...Full timeTemporary work$172k - $209k
...Hardware Production / Sustaining Engineer Crusoe is on a mission to accelerate the abundance... ...Engineer to strengthen Crusoe's Hardware Systems Engineering team and close critical skill... ...and how to leverage them in AI/HPC environments. Expertise supporting or...Temporary work$335k
...infrastructure that powers large-scale AI systems. We design and deliver next-generation... ...About the Role We are seeking a System Engineer (Network / Storage / Systems) to help... ...hyperscale infrastructure, AI clusters, HPC environments, or data center systems....Work at officeRelocation package$103.76k - $118.77k
...System Engineer Daly City, CA 94014 Overview Salary Range $103,760.80 - $118,768.00 Description The Systems Engineer will be responsible for overseeing the entire IT system infrastructure ensuring that servers, operating systems, applications, and related...Full timeWork experience placement$181.1k - $318.4k
...Bluetooth Mac Systems Engineer Come and join Apple's growing wireless silicon development team. Our wireless SoC organization is responsible for all aspects of wireless silicon development, emphasizing highly energy-efficient design and new technologies that transform...Relocation$181.1k - $318.4k
...Cellular Phy Systems Engineer Work Locations (2) Submit Resume Imagine what you could do here. At Apple, new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job, and there...Relocation$100k - $120k
...NPI Systems Engineer Are you interested in working with the World's leading AI-powered Quality Engineering Company? Ready to advance your career, team up with global thought leaders across industries and make a difference every day? Join us at Qualitest! We are looking...Casual workLocal areaFlexible hours$91.2k - $114k
...meaningfully shape the future of cardiac health, our company, and your career About This Role: About the Role As a Systems Engineer II on the Product Development System Engineering team, you'll play a key role in developing and improving test solutions that...Work at officeWork visa$123.27k - $167.3k
...CommVault Systems Engineer (Data Protection / Backup) Employment Type: Full-Time, Experienced Department: Technology Support CGS is seeking an experienced CommVault Data Protection Engineer with extensive knowledge and experience in designing, developing, configuring...Full timeFlexible hours- Job Title Drive post-silicon debug of HBM, LPDDR, and GDDR memory subsystems, focusing on data corruption and stability issues Execute and analyze loopback tests, PHY tuning, timing calibration, and controller configuration sweeps Work closely with IP vendor (...
$139k - $204k
...Systems Engineer, People Systems CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups...Permanent employmentTemporary workCasual workWork at officeRemote workFlexible hours- ...Team : We've assembled authentication, integrations, distributed systems, and AI experts from Okta, Redis, Microsoft, Splunk, Ngrok,... ...~ An insatiable desire to ship. ~7+ years of software engineering experience comprising of: ~5+ years of backend development...Work at officeShift work
- ...E2B Infrastructure Engineer E2B is a fast-growing Series A startup with 8-figure revenue. We've raised over $37M since our founding... ...software apps. Your job will be: # Building a distributed system for millions and billions of AI agents running on E2B #...Work from homeRelocation
$181.1k - $318.4k
...PHY Systems Engineer – Mobility Control Work Locations (2) Submit Resume Imagine what you could do here! At Apple, new ideas have a way of becoming extraordinary products, services, and customer experiences very quickly. Bring passion and dedication to your job,...Relocation- ...data, run data applications, so they can spend more time putting knowledge into action. We're looking for engineers who want to build the operating system for AI Data Applications and Workflows. About the role We're looking for experienced distributed systems...
- ...System Engineer Location: Foster City, CA OR San Francisco, CA (Hybrid) Role Summary: The Systems Engineer will be responsible for developing system requirements, test cases, test automation, and power train systems issue RCA support for an autonomous vehicle....Shift work
- ...Sesame Systems Engineer Role Sesame believes in a future where computers are lifelike - with the ability to see, hear, and collaborate with us in ways that feel natural and human. With this vision, we're designing a new kind of computer, focused on making voice agents...Full timeContract workFlexible hours
- ...XPU chips are state-of-the-art AI compute engines capable of reconfiguring themselves to... ...hardcore when pushing technical boundaries A systems-level hardware engineer who thinks from... ...Plus If ~ Experience designing AI/HPC accelerator boards or systems (GPU...
$141.5k - $224k
...to join our team. What you'll do The Integrated Solutions Business Unit is seeking a visionary and technically versatile Systems Engineer to act as the "glue" that holds our complex projects together. While specialized engineers focus on specific components, you will...Contract work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to HPC Systems Engineer. Be the first to apply!
- healthcare systems engineer San Francisco, CA
- wireless systems engineer San Francisco, CA
- system test engineer San Francisco, CA
- unix linux systems engineer San Francisco, CA
- electronic systems engineer San Francisco, CA
- systems engineer San Francisco, CA
- system safety engineer San Francisco, CA
- ground systems engineer San Francisco, CA
- operations support system engineer San Francisco, CA
- digital communications systems engineer San Francisco, CA

