HPC Infrastructure DevOps Engineer II

$86.32k - $154.96k

St. Jude Children's Research Hospital

Position Overview St. Jude is seeking an HPC Infrastructure DevOps Engineer II to join the High‑Performance Computing Support (HPCS) team. This role is responsible for the smooth operation, automation, and continuous improvement of St.Jude’s high‑performance computing environment, with a focus on HPC operations, DevOps practices, and automation for configuration, testing, monitoring, and autonomous remediation. HPC compute platforms for research and data‑intensive workloads GPU‑enabled environments for AI and machine learning applications High‑capacity research, compliant, and scratch storage tiers Archival, backup, and disaster recovery services Operational tooling for observability, governance, and process automation Working closely with infrastructure, storage, security, and research teams, the HPC Infrastructure DevOps Engineer II will deliver reliable and scalable services for computational science, regulated workflows, and AI‑enabled research. This role is central to the HPCS service portfolio, including daily HPC client request fulfillment, performance and utilization monitoring, data management and governance, data cataloguing and archival services, and HPC process automation DevOps. Job Responsibilities HPC Infrastructure Operations Support the day‑to‑day operation of St.Jude’s HPC infrastructure across compute and storage platforms. Maintain a stable, secure, and scalable environment for research computing and data‑intensive scientific workflows. Work with downstream operational teams to ensure systems are configured, validated, monitored, patched, and maintained effectively. Participate in infrastructure testing, upgrade activities, service transitions, and operational readiness efforts. Contribute to the reliability and supportability of hybrid HPC environments spanning primary and remote‑site services. Daily HPC Client Request Fulfillment Respond to daily user requests involving HPC access, Linux environment support, storage allocation, software availability, job troubleshooting, and data movement. Provide timely and effective support to researchers, analysts, and technical staff using HPC and AI‑enabled research resources. Resolve service incidents and user issues through structured troubleshooting and escalation as needed. Maintain service‑oriented communication with users and stakeholders to support a high‑quality support experience. Performance and Utilization Monitoring Implement and improve monitoring for compute nodes, GPU resources, scheduler activity, storage systems, backup operations, and platform health. Track usage trends, availability, capacity consumption, and operational KPIs to support efficient service delivery. Analyze utilization patterns and recommend improvements to throughput, performance tuning, scheduling efficiency, and user experience. Build and maintain dashboards, metrics collection workflows, health checks, and alerting mechanisms to support proactive operations and continuous process improvement. Support governance reporting and visibility into service consumption and infrastructure health. Data Management and Governance Support operational controls for research and compliant data across active storage, protected environments, backup systems, and archival tiers. Implement and maintain standards for data handling, retention, access control, traceability, and lifecycle operations. Contribute to governance tracking and reporting for HPC‑supported data services. Assist with data movement and retention workflows across high‑performance, compliant, backup, and archival storage platforms. Data Cataloguing and Archival Services Support data intake, metadata‑aware cataloguing, archival placement, recall, restore validation, and tier‑to‑tier data movement. Assist with workflows involving archival platforms, cold storage, backup systems, and long‑term retention services. Improve discoverability and lifecycle management of research datasets through automation and procedural standardization. Support operational validation of archival and recovery workflows for critical data services. HPC Process Automation DevOps Use automation tooling to handle system configuration, provisioning, platform maintenance, testing, and operational workflows. Enable DevOps lifecycle functions by supporting tooling and processes for development, testing, release, and operational support. Build and maintain CI/CD pipelines and repeatable infrastructure workflows to improve reliability, consistency, and deployment speed. Reduce manual effort by developing scripts, integrations, and self‑service mechanisms for recurring HPCS operational tasks. Apply automation and generative AI tools responsibly to improve scripting, documentation, incident analysis, and support efficiency. AI and Accelerated Computing Support Assist with deployment and support of AI software stacks, containerized research environments, and Python‑based computational workflows. Help optimize GPU utilization, data throughput, storage access patterns, and job execution for AI training and inference use cases. Support reproducible environments for AI applications through dependency management and platform validation. Maintain GPU software stacks, containerized runtimes, dependency consistency, and high‑throughput data access for distributed AI training and inference. Contribute to operational support for research teams using container runtimes, distributed job workflows, and accelerator‑aware scheduling. Security, Risk, and Incident Response Identify and deploy security measures through vulnerability assessment, configuration review, patch coordination, and risk‑aware operational practices. Participate in incident response, troubleshooting, and root cause analysis for infrastructure and service disruptions. Support backup validation, restore readiness, and disaster recovery operational practices for critical HPC services. Follow institutional requirements for secure handling of research and regulated data. Collaboration and Continuous Improvement Work with end users and partner teams to understand operational needs, user requirements, and service KPIs. Coordinate across technical teams to improve service quality, communication, and execution. Contribute to ongoing process improvement initiatives that reduce lead time, strengthen platform reliability, and improve user experience. Maintain accurate technical documentation for systems, configurations, procedures, and knowledge articles. Perform other duties as assigned to support the goals and objectives of the department and institution. Minimum Education Bachelor's degree in Computer Science, Engineering, business or related field of study required. Master's degree preferred. Minimum Experience Minimum requirement: Two (2) years of IT experience with experience in infrastructure operations and engineering environments. Some experience in infrastructure design, systems analysis, and security management. Some experience working with business stakeholders to identify and document requirements. Proven performance in earlier role/comparable role. Preferred Qualifications Infrastructure & Operations Experience Supporting Linux‑based enterprise or research computing environments. Experience with scripting or automation using Python, Bash, or similar languages. Familiarity with DevOps and infrastructure‑as‑code practices using tools such as Ansible, Terraform, Git‑based workflows, or CI/CD platforms. Experience with observability platforms for logs, metrics, dashboards, and alerting. Familiarity with HPC workload schedulers such as IBM LSF, Slurm, or comparable systems. Experience supporting high‑performance storage, backup, and archival services. Familiarity with containers and reproducible compute environments such as Singularity, Docker, or related platforms. Understanding of secure multi‑user platform operations in research or regulated environments. AI / ML Environment Experience Supporting GPU‑based systems for AI and machine learning workloads. Familiarity with CUDA‑enabled environments, GPU monitoring, and NVIDIA software stack dependencies (CUDA, cuDNN, NCCL). Experience with AI/ML ecosystems such as PyTorch, TensorFlow, Jupyter, and distributed training workflows. Familiarity with distributed and multi‑GPU training frameworks such as PyTorch Distributed, DeepSpeed, Horovod, or Ray. Understanding of data pipeline, storage throughput, checkpointing, and large dataset staging requirements for model training and inference. Familiarity with operational practices adjacent to MLOps, including experiment tracking, artifact handling, workflow automation, and workload observability. Understanding of secure and compliant support for AI workloads operating on sensitive research data. Ability to troubleshoot AI application issues across infrastructure, scheduler, storage, container, and accelerator layers. Generative AI Productivity Familiarity with generative AI tools that improve productivity in infrastructure operations, DevOps, and technical support workflows. Ability to use generative AI assistants to accelerate scripting, automation, troubleshooting, and documentation tasks. Familiarity with prompt design and iterative prompting techniques for script development, log analysis, workflow generation, and systems diagnostics. Understanding of the limitations, risks, and verification requirements of generative AI outputs, including accuracy validation, security awareness, and protection of sensitive or regulated data. Ability to identify practical use cases for generative AI that reduce manual effort across HPC support, governance tracking, and process automation. Core Competencies Strong troubleshooting and analytical skills. Service‑oriented mindset with excellent communication and follow‑through. Ability to work effectively across infrastructure, operations, and user‑facing support functions. Commitment to documentation, process discipline, and continuous improvement. Comfort operating in a mission‑driven research and regulated data environment. Ability to balance technical depth with responsiveness in a fast‑paced, high‑stakes setting. Compensation In recognition of certain U.S. state and municipal pay transparency laws, St.Jude is including a reasonable estimate of the compensation range for this role. St.Jude has provided a salary range of $86,320 - $154,960 per year for the role of HPC Infrastructure DevOps Engineer II. Equal Opportunity Employer St.Jude Children's Research Hospital is an Equal Opportunity Employer. St.Jude does not accept unsolicited assistance from search firms for employment opportunities. All resumes submitted by search firms to any employee or other representative at St.Jude via email, the internet or in any form and/or method without a valid written search agreement in place and approved by HR will result in no fee being paid in the event the candidate is hired by St.Jude. #J-18808-Ljbffr

Apply

Vacancy posted 6 hours ago

Similar jobs that could be interesting for youBased on the HPC Infrastructure DevOps Engineer II in Memphis, TN vacancy

HPC Infrastructure DevOps Engineer II
$86.32k - $154.96k
...treats, and defeats childhood cancer and other life‑threatening diseases. Position Overview St. Jude is seeking an HPC Infrastructure DevOps Engineer II to join the High‑Performance Computing Support (HPCS) team. This role is responsible for the smooth operation, automation...
Suggested
Remote work
Thecentermemphis
Memphis, TN
6 hours ago
HPC & AI Infra DevOps Engineer II
$86.32k - $154.96k
...Thecentermemphis is hiring an HPC Infrastructure DevOps Engineer II to enhance support for their high-performance computing infrastructure. The role includes responsibilities in daily operations, automation, and continuous improvement of services for AI and data-intensive...
Suggested
Thecentermemphis
Memphis, TN
1 day ago
Advanced HPC DevOps Engineer for AI & GPU Workloads
...St. Jude Children's Research Hospital, Inc. is looking for an HPC Infrastructure DevOps Engineer II to join their High-Performance Computing Support team. This role is crucial in managing St. Jude's high-performance computing environment, ensuring operational efficiency...
Suggested
St. Jude Children's Research Hospital
Memphis, TN
1 day ago
IT Systems Engineer II (Client Platforms Support)
...Weekly Schedule: Monday- Friday, 9am-5pm We are seeking a Systems Engineer II with experience in troubleshooting, systems triage, and ticket... ...Security with Single Sign On (SSO). Familiarity with DevOps best practices including DevOps tools (i.e. Jenkins, Bitbucket...
Suggested
Full time
Work at office
Monday to Friday
First Horizon
Memphis, TN
1 day ago
IT Systems Engineer II (Client Platforms Support)
...Weekly Schedule: Monday- Friday, 9am-5pm We are seeking a Systems Engineer II with experience in troubleshooting, systems triage, and ticket... ...Security with Single Sign On (SSO). Familiarity with DevOps best practices including DevOps tools (i.e. Jenkins, Bitbucket...
Suggested
Work at office
Monday to Friday
First Horizon Bank
Memphis, TN
1 day ago
Network Engineer - AI/HPC
...Network Engineer - AI/HPC Memphis, TN; Palo Alto, CA About XAI XAI's mission is to create AI systems that can accurately understand... ...that will allow us to seamlessly build-out new GPU infrastructure with little to no engineering assistance. There will be...
Xai
Memphis, TN
4 days ago
Remote DevOps & Platform Engineer for AI Training Projects
...Mercor is inviting applications for remote DevOps / Platform Engineer roles to join our expert network. As part of this open application, candidates... ...should have experience in CI/CD pipelines, cloud infrastructure, and containerization technologies and be able to work independently...
Remote work
Flexible hours
Mercor Inc
Memphis, TN
7 days ago
DevOps Engineer
...an experienced Windows Server Engineer to join our Windows... ...in onboarding teams into our DevOps platform and driving automation... ...applications hosted on Windows Server infrastructure - Troubleshoot complex... ...effort - Manage and maintain IIS-hosted web applications, including...
Monday to Friday
JobAdX
Memphis, TN
2 days ago
Construction Engineer II (TN) - Field Infrastructure Lead
...A global engineering firm is seeking a Tennessee Construction Engineer to manage complex construction assignments and ensure project... ...documentation. Join a collaborative culture committed to employee ownership and excellence in infrastructure development. #J-18808-Ljbffr...
Fashion Institute of Design & Merchandising
Memphis, TN
1 day ago
IT Systems Engineer(Client Platforms Support)
...Weekly Schedule: Monday- Friday, 9am-5pm We are seeking a Systems Engineer to work in Client Platforms troubleshooting, systems triage,... ...Application Security with Single Sign On (SSO). Familiarity with DevOps best practices including DevOps tools (i.e. Jenkins, Bitbucket)...
Full time
Work at office
Monday to Friday
First Horizon
Memphis, TN
1 day ago
Network Engineering - Remote
...Network Engineer Memphis, TN xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in... ...regulations, applicant must be a (i) U.S. citizen or national, (ii) U.S. lawful, permanent resident (aka green card holder), (iii)...
Permanent employment
Work at office
Remote work
Xai
Memphis, TN
2 days ago
Senior Identity Platform Engineer - Transmit Security
...Senior Identity Platform Engineer Location: Memphis, TN Role Summary Hands-on production owner for Transmit Security FlexID BAU... .../Sandbox/API Explorer/OpenAPI artifacts for debugging. DevOps/Release: Execute safe deployments (phased/staggered, conditional...
United IT
Memphis, TN
3 days ago
DevOps Engineer
...DevOps Platform Engineer Leverage Engineering background and skills to transition to a DevOps Platform Engineer. Apply experience to understand design concepts and implement features within the DevOps platform. Build, test and deploy changes to a common toolchain framework...
United IT
Memphis, TN
3 days ago
Senior Agentic DevOps Engineer - Remote - USA
...The Position We're looking to hire a Senior Agentic DevOps Engineer to join our team. You'll work with our incredible clients in... ...experience as a DevOps engineer role supporting production cloud infrastructure at scale. ~ Advanced English is required. ~ Successful...
Remote work
FullStack Labs
Memphis, TN
1 day ago
Senior DevOps Engineer
...Senior DevOps Engineer Location: Kirkland, WA, Harford CT & Memphis TN. Local candidates are highly preferred due to onsite expectations... ...in DevOps practices, CI/CD pipelines, cloud platforms, infrastructure automation, and monitoring tools is a MUST. The ideal...
Local area
Diversity Nexus
Memphis, TN
1 day ago
Integration Platform Developer
...limiting, API key validation) • Participate in CI/CD pipelines and DevOps practices • Contribute to cloud migration and legacy... ...-driven architecture • Exposure to Azure cloud and AI-enabled engineering practices • Collaborative, high-visibility team environment •...
Monday to Friday
First Horizon Bank
Memphis, TN
4 days ago
SDC BCM Tech DevOps Engineer-Senior-Tampa
$67k - $136.8k
...Location: Anywhere in Country The opportunity As an FSO DevOps Engineer Senior Analyst, you’ll be based in our Service Delivery Center... ...establishing and maintaining DevOps best practices, CI/CD automation, infrastructure reliability, observability, and secure deployment patterns....
Summer holiday
Flexible hours
Ernst & Young Oman
Memphis, TN
7 hours ago
Network Engineer
...the Southwest Team! Title: Network Engineer Employee Classification: Other... ...Community College Department: Infrastructure Services Administration Campus... ...networking. Collaborate with DevOps teams to implement Infrastructure-as-Code...
Full time
Local area
The Tennessee Board of Regents
Memphis, TN
1 day ago
Senior DevOps Engineer
$109.85k - $184.61k
...Overview Salary Range: $109,853-$184,613 Senior DevOps Engineers lead the design and implementation of hybrid clouds infrastructure, kubernetes clusters, and CI/CD pipelines. They are responsible for architecting and building platforms using Infrastructure as Code...
Temporary work
Local area
Flexible hours
Yusen Logistics
Cordova, TN
9 days ago
Pivotal Cloud Architect
...versus public cloud options, virtualization, containerization, and DevOps 5+ years of hands on experience in architecture design,... ...with DevOps practices and technologies in a software-defined infrastructure environment, delivering Infrastructure as Code (IaC) 5+ years...
SonSoft
Memphis, TN
3 days ago
Cloud Architect(Azure SME)
...ALZ deployment tools. Proficiency with infrastructure as code (ARM templates, Bicep, Terraform) and CI/CD pipelines (Azure DevOps, GitHub Actions). Experience implementing... ...Solutions Architect Expert, Azure DevOps Engineer Expert). Experience in a government or...
ACL Digital
Olive Branch, MS
2 days ago
Cloud AWS Engineer Lead @ Memphis, TN
...REQUIREMENT Cloud AWS Engineer Lead Location: Memphis, TN Duration: 6-12 Months Experience... ...Engineer Lead with strong hands-on experience in cloud infrastructure, automation, DevOps, and container orchestration. The ideal candidate will...
Maintec Technologies
Memphis, TN
5 days ago
Azure Data Engineer
...Description : Minimum of 10+ years of experience in data engineering, with a focus on Azure Data Factory. Strong background in... ...skills. Ability to work independently and as part of a team. Knowledge of DevOps practices and CI/CD pipelines
United IT Solutions
Memphis, TN
2 days ago
Microsoft 365 Engineer - SharePoint & Power Platform
A leading transportation management company in Memphis, TN is seeking a Microsoft Office 365 Developer to enhance collaboration and document management solutions. The role involves designing SharePoint architectures, implementing automated workflows with Power Automate...
Work at office
Mid-South Transportation Management, Inc.
Memphis, TN
5 hours ago
Lead Engineer, Network Security Platform
...as we are, join our team. KPMG is currently seeking a Lead Engineer, Network Security Platform to join our Digital Nexus... ...Alto) integrate effectively with other aspects of the technical infrastructure and ecosystem Maintain deep technical and business knowledge...
H1b
Local area
KPMG
Memphis, TN
3 days ago
IT Senior Network Engineer
...proficient across all areas of network engineering, with a strong team‑oriented attitude.... ...weekend on‑call rotations. Coordinate with infrastructure, application teams and service... ...and automation using Python and other DevOps languages. Experience with SSLVPN, NGINX...
Work experience placement
Local area
Remote work
Monday to Friday
Flexible hours
Weekday work
Conectiv Supply Chain Solutions, Inc.
Memphis, TN
6 hours ago
Client Platforms Systems Engineer - Incident & Monitoring
...Firsthorizon is looking for a Systems Engineer in Memphis, TN. This position involves troubleshooting, systems triage, and incident support, requiring both independent and collaborative work. The ideal candidate will possess a Bachelor's degree in a relevant field and...
First Horizon
Memphis, TN
1 day ago
Azure Solution architect
...ALZ deployment tools. Proficiency with infrastructure as code (ARM templates, Bicep, Terraform) and CI/CD pipelines (Azure DevOps, GitHub Actions). Experience implementing... ...Solutions Architect Expert, Azure DevOps Engineer Expert). Experience in a government or...
For contractors
Resolvetech Solutions
Southaven, MS
2 days ago
Lead Cloud AI Platforms Engineer
$40 per hour
...hospitality industry around the world! As a Lead Cloud AI Platforms Engineer , you will bring your technical skills to a hospitality... ...on target. Identify opportunities for AI, ML and data infrastructure automation using Infrastructure as Code, improvements in configuration...
Full time
Work experience placement
Remote work
Worldwide
Night shift
Hilton
Memphis, TN
1 day ago
Infrastructure Engineer III
...interview process will be initiated as soon as possible. We are excited to hear back from you. Job Description: Role: Infrastructure Engineer III Location : Memphis, TN Contract Length - 6 months Interview mode: Zoom Video Interview...
Hourly pay
Contract work
Immediate start
Syntricate Technologies
Memphis, TN
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to HPC Infrastructure DevOps Engineer II. Be the first to apply!