Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

HPC Data Center Operational Lead

P2P

HPC Infrastructure Operations Lead Location: Chicago or New York (On‑site 5 days/week; regular travel to HPC data center sites required) Jump's HPC infrastructure powers some of the most demanding computational workloads in the industry. As our HPC footprint grows, we need a seasoned operations leader to own the reliability, standards, and day‑to‑day excellence of these environments. What You'll Do: Team Leadership & Organizational Ownership Lead and manage data center site leads and their teams across multiple HPC facilities; site leads report directly to this role. Recruit, mentor, and develop team members while conducting performance reviews and building a culture of operational rigor. Direct onsite contractors by providing clear scope and validating completed work. HPC Data Center Standards, Processes & Preventative Maintenance Develop, document, and enforce operational standards and procedures for Jump's HPC data centers covering power, cooling, cabling, and hardware lifecycle. Design and own the preventative maintenance program, including scheduled inspections, component replacements, and firmware/capacity reviews to minimize unplanned downtime. Drive continuous improvement of operational processes and pursue automation—including AI‑driven approaches—to reduce manual effort and human error. Critical Facility Systems Expertise Serve as the subject matter authority on HPC data center power distribution, power striping strategies, and failover/redundancy configurations. Own expertise across air cooling, liquid cooling (direct‑to‑chip, rear‑door, CDU‑based), and hybrid cooling architectures. Maintain deep knowledge of environmental monitoring and controls (temperature, humidity, airflow, leak detection) and ensure systems remain within design parameters. Monitoring & Incident Response Own the HPC data center monitoring strategy end‑to‑end: define what is monitored, set alerting thresholds, and ensure comprehensive visibility into facility and hardware health. Leverage AI tools to analyze telemetry data, identify failure patterns, predict potential issues, and accelerate root cause analysis during incidents. Lead critical incident response and drive root cause analysis and corrective actions to prevent recurrence. Establish and track operational KPIs including availability, mean time to repair, and efficiency metrics. Server & Switch Hardware Expertise Maintain deep, hands‑on knowledge of server hardware architectures including multi‑socket platforms, GPU/accelerator configurations, memory subsystems, NVMe/storage controllers, BMC/IPMI management, and firmware lifecycle. Maintain deep, hands‑on knowledge of network switch hardware including line cards, optics/transceivers, switch fabrics, and platform‑specific diagnostics for Arista and Cisco platforms. Evaluate new hardware platforms, drive hardware qualification and acceptance testing, and provide informed recommendations on hardware selection. Hardware Break‑Fix Own the overall hardware break‑fix function across all HPC sites, ensuring rapid diagnosis and resolution for servers, GPUs, network equipment, storage, and facility infrastructure. Diagnose complex hardware failures at the component level—CPUs, DIMMs, GPUs, NICs, PSUs, fans, drives, switch line cards, and optics—and direct the team to resolve efficiently. Establish escalation paths, SLA targets, and reporting for hardware failures. Inventory & Spares Management Own inventory processes and spares tracking across all HPC facilities, ensuring critical spares are stocked, tracked, and replenished to meet availability targets. Maintain accurate asset records for all serialized and consumable inventory. Planning, Vendor & Budget Management Conduct capacity planning for space, power, cooling, and cabling to stay ahead of growth. Gather requirements and plan new hardware installations including physical placement, power/cooling needs, and cabling. Manage relationships with colocation providers and hardware vendors; negotiate contracts and SLAs. Develop and manage operational budgets for equipment, staffing, and facilities. Networking & Linux Possess strong working knowledge of networking concepts including L2/L3 protocols, VLANs, BGP, OSPF, LACP, ECMP, and high‑performance fabrics relevant to HPC environments. Understand network architectures such as spine‑leaf, fat‑tree, and high‑radix topologies used in HPC clusters. Maintain strong Linux systems knowledge—comfortable navigating and troubleshooting at the OS level, including storage, networking, process management, log analysis, and system diagnostics. AI‑Driven Operations Use AI tools daily across all aspects of the role: writing and reviewing documentation, analyzing operational data, drafting procedures, managing communications, and problem‑solving. Champion AI adoption within the team—set the expectation that every team member integrates AI into their daily workflows. Identify and implement opportunities where AI can replace or augment manual operational processes. Cross‑Team Partnership Partner with HPC Engineering, Network Engineering, and other teams to align operations with research and business needs. Ensure compliance with all safety, security, and regulatory requirements. Travel Travel regularly to Jump's HPC data center sites for operational oversight, project execution, and team engagement. This is a core requirement of the role. Additional duties as assigned or needed. Skills You'll Need: Minimum 7+ years of data center operations experience with at least 3 years leading teams in 24/7 critical infrastructure environments. HPC environment experience strongly preferred. In‑depth knowledge of data center power systems, power distribution/striping, and failover/redundancy architectures. In‑depth knowledge of cooling technologies including air cooling, liquid cooling (direct‑to‑chip, rear‑door heat exchangers, CDUs), and environmental control systems. Proven experience building and maintaining preventative maintenance programs and operational standards/procedures. Strong experience with data center monitoring platforms (DCIM, BMS, environmental sensors) and defining monitoring/alerting strategies. Demonstrates a high level of energy, results driven, and able to work under pressure with tight deadlines. Technical Skills: Deep knowledge of server hardware architectures: multi‑socket platforms, GPU/accelerator systems, memory subsystems, NVMe storage, BMC/IPMI, and firmware management. Deep knowledge of network switch hardware: line cards, optics/transceivers, switch fabrics, and platform diagnostics across Arista and Cisco platforms. Proven hardware break‑fix experience with the ability to diagnose failures at the component level (CPUs, DIMMs, GPUs, NICs, PSUs, drives, line cards, optics). Strong understanding of networking concepts: L2/L3 protocols, VLANs, BGP, OSPF, LACP, ECMP, and HPC network topologies (spine‑leaf, fat‑tree). Strong Linux systems proficiency—well beyond basic CLI usage. Comfortable with OS‑level troubleshooting, storage and network configuration, process management, log analysis, and system diagnostics. Experience managing inventory and spares programs for critical infrastructure. Structured cabling standards expertise. Programming/scripting experience (Python preferred) is a plus. Demonstrated heavy use of AI tools (e.g., LLM‑based assistants, AI coding tools, AI‑driven analytics) in a professional setting. You should already be using AI daily and be eager to push its application further across operations. Strong project management skills with multi‑site infrastructure deployment experience. Knowledge of industry standards including ASHRAE and TIA‑942. Excellent written and verbal communication skills with the ability to communicate effectively across technical and non‑technical audiences. Meet physical requirements including working on ladders/elevated platforms and lifting up to 50 lbs. Extremely high personal standards for work quality and operational discipline. Reliable and predictable availability, including ability to work evenings and weekends as required. Willingness and ability to travel regularly to data center sites. Bachelor's degree preferred. #J-18808-Ljbffr

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the HPC Data Center Operational Lead in New York, NY vacancy
  • P2P is seeking an HPC Infrastructure Operations Lead to manage critical infrastructure and ensure the reliability of our data centers. This on-site role in Chicago or New York requires leadership and technical expertise in data center operations, including a strong focus... 
    Suggested

    P2P

    New York, NY
    5 days ago
  •  ...Senior Technical Program Manager (HPC, Linux) New York, NY (...  ...Technical Program Manager to lead the charge in establishing,...  ...of passionate problem-solvers operating in a dynamic, high-performance...  ...(HPC) environments, data centers, and multi-node architectures... 
    Suggested
    Work at office

    Elliot Partnership

    New York, NY
    9 days ago
  • Framework Ventures is looking for a leader to manage their site mining operations team in the United States. The role involves achieving operational excellence, hiring and developing a strong team, and overseeing asset lifecycle management. Key qualifications include experience... 
    Suggested

    Framework Ventures

    New York, NY
    2 days ago
  • $125k - $150k

     ...electric vehicles, renewable energy, and data centers. KoBold builds AI models for mineral...  ...—to guide decisions on KoBold-owned-and-operated exploration programs. In the six years since...  ...and software engineers, who come from leading technology companies, jointly lead exploration... 
    Suggested
    Full time
    Contract work
    For contractors
    Seasonal work
    Local area
    Remote work

    KoBold Metals

    New York, NY
    2 days ago
  •  ...guidelines and maintain overall customer satisfaction Travel to data centers or client sites for on-site technical work Support...  ...) Authentication Infrastructure (SSO, Entra ID, Okta) Strong operation and troubleshooting skills in: Monitoring tools (AWS CloudWatch... 
    Suggested
    Full time
    Work at office
    Visa sponsorship

    Cinter Career Services, LLC

    New York, NY
    4 days ago
  •  ...The Brooklyn Hospital Center is seeking a Senior Data Center Operations Specialist to provide support to computer end-users, including remote sites. Responsibilities include troubleshooting IT projects and serving as a point of contact for various teams. The ideal candidate... 
    Remote work

    The Brooklyn Hospital Center

    New York, NY
    18 hours ago
  • Cumming Management Group, Inc. is seeking a Senior Construction Project Manager to oversee data center projects in New York and New Jersey. This role involves managing project schedules, costs, and stakeholder communications, ensuring high quality and timely delivery. The... 

    Cumming Management Group, Inc.

    New York, NY
    4 days ago
  • $185k - $250k

     ...A leading electrical contractor in the United States seeks a Project Executive to oversee a portfolio of electrical construction projects. You will lead project managers, build client relationships, and ensure project execution and financial performance. Ideal candidates... 
    For contractors

    Verrus

    New York, NY
    2 days ago
  • A leading data center solutions provider seeks an Operations Director to oversee multiple facilities and ensure operational excellence. The successful candidate will guide local and remote teams, manage key performance indicators, and enhance operational best practices... 
    Local area
    Remote work

    T5 Data Centers

    New York, NY
    2 days ago
  •  ...From revolutionizing power for AI‑driven data centers to ensuring resilience for hospitals,...  ...21st century. We are looking for a Sr. Operations Manager to join our team in one of today...  ...customer facility and operations teams. Lead daily O&M routines, inspections, system... 
    For contractors
    Work at office
    Remote work
    Worldwide

    Bloom Energy

    New York, NY
    3 days ago
  •  ...confidence. As the U.S. enters a new capex supercycle across data centers, factories, housing, and renewables, joining PermitFlow...  ...Uber. Role Overview We're looking for a Head of Operations to lead and scale the operational engine behind PermitFlow. This... 
    For contractors

    PermitFlow

    New York, NY
    1 day ago
  •  ...and hyperscale customers with customized data center solutions. Today, we remain dedicated to...  ..., facilities management and data center operations to customized construction needs...  ...resolutions Ability to assemble, develop, and lead highly technical and diverse teams Project... 
    Daily paid
    Contract work
    Local area
    Remote work
    Worldwide
    Shift work

    T5 Data Centers

    New York, NY
    2 days ago
  • $90k - $110k

     ...New York University is seeking an AI Operations Lead to oversee operations for its AI Center of Excellence. This role involves managing day-to-day functions, including support ticket handling and event logistics, while acting as a primary contact for the NYU community... 

    New York University

    New York, NY
    1 day ago
  • $120k - $150k

     ...Contract Lifecycle Management (CLM) Data Management & Governance...  ...support informed decision-making, operational efficiency, and strategic...  ...Contract Data Readiness Lead and support contract data initiatives...  ...the full spectrum of data center, colocation, and... 
    Contract work
    Work at office

    Digital Realty

    New York, NY
    1 day ago
  •  ...CDC Data Centres in New York is seeking a leader for data centre operations, focusing on delivering a world-class customer experience. Responsibilities include managing logistics, security, and ensuring facilities operate without interruption. The ideal candidate has experience... 

    CDC Data Centres

    New York, NY
    3 days ago
  •  ...Balance Innovations is seeking a Security Operations Center Specialist who will manage all physical security efforts. This role involves monitoring alarms and communicating effectively with internal teams to address any issues. Minimum two years of security experience... 

    Balance Innovations

    New York, NY
    3 days ago
  • $50 - $55 per hour

     ...Zealogics.com is seeking a project lead for mechanical, electrical, and critical infrastructure CAPEX initiatives in New Jersey. Responsibilities...  ...years of experience in MEP project management, particularly in data center environments. Strong leadership, communication, and technical... 
    Hourly pay

    Zealogics

    New York, NY
    4 days ago
  •  ...ABB Inc. in Oklahoma is looking for a leader in Data Center Services to ensure superior service operations. In this role, you will shape strategies, enhance operational excellence, and guide a strong technical team while maintaining customer relationships and a culture... 

    ABB

    Brooklyn, NY
    2 days ago
  • The Assistant Operational Manager plays a critical role in supporting the General Manager with...  ...the day-to-day operations of the Service Center. This role ensures that the facility meets...  ...at customer sites and promptly share data with Move Coordinators Perform on-site job... 
    Work at office
    Relocation
    Shift work

    New World Van Lines, Inc.

    New York, NY
    2 days ago
  •  ...Operational Leads Responsible day-to-day management of complex direct service programs, for example, Humanitarian Emergency Response and Relief Center (HERRC) facilities, which offer direct service provision, resource navigation, and temporary shelter to single adults... 
    Temporary work

    Phaxis

    New York, NY
    1 day ago
  •  ...Authority of New York and New Jersey is hiring a Cybersecurity Operations Center Manager to oversee the daily operations of the Cybersecurity...  ...(CSOC). This role includes managing vendor performance, leading incident response efforts, and ensuring compliance with cybersecurity... 

    Port Authority of New York and New Jersey

    Jersey City, NJ
    4 days ago
  • $127.89k - $170.52k

     ...Req ID: 372841 NTT DATA strives to hire exceptional, innovative...  ...seeking an Infrastructure Operations Consultant - Core Banking to...  ...innovation. We are one of the world's leading AI and digital infrastructure...  ...security, connectivity, data centers and application services. our... 
    Temporary work
    Work at office
    Remote work
    Flexible hours

    NTT America

    Jersey City, NJ
    5 days ago
  • $131k - $169k

     ...Data & Analytics Operations Manager - AI & Systems New York, New York Who We Are At Justworks, you’ll enjoy...  ...team is Justworks' data and AI center of excellence - a connected set of functions...  ..., Slack workflows, AI tools) Lead enablement and upskilling on ways of working... 
    Casual work
    Work at office
    Local area

    Justworks

    New York, NY
    3 days ago
  • $60 - $90 per hour

    Apex Systems is seeking a Design Execution Project Manager located in Haverhill, OH. You will manage the design process, working one day a week on-site with internal and external stakeholders. The ideal candidate has a Bachelor's degree and at least 7 years of experience...
    Hourly pay
    1 day per week

    Apex Systems

    Brooklyn, NY
    1 day ago
  •  ...A leading engineering firm in the United States is seeking a skilled Project Manager to lead complex data center projects. The role involves assembling and guiding teams, managing budgets, and ensuring timely completion of all project stages. The ideal candidate will... 

    Olsson

    New York, NY
    2 days ago
  • Position: Operations Director Location: Valley Green, PA Amphenol Communications Solutions...  ...including server, storage, data center, mobile, RF, networking, industrial, business...  ...platforms, helping power the technology behind leading Tier 1 OEMs. With global design, sales,... 

    Amphenol ICC

    New York, NY
    2 days ago
  •  ...A leading consulting firm based in the United States is seeking a Data Center Project Manager to oversee the delivery of mechanical and electrical construction projects...  ..., and ensuring compliance with safety and operational standards. Candidates should have 5–10 years... 

    ALLTECH CONSULTING SVC INC

    New York, NY
    4 days ago
  •  ...Virginia, Maryland, Delaware and Washington D.C. is hiring a Data Center Project Manager in Pittsburgh, Pennsylvania . The Data Center...  .../Ethernet communication, Generator paralleling, Diesel engine operating requirements/room design, Fuel and Cooling systems, Emissions... 
    For contractors

    Carter Machinery Company, Inc.

    New York, NY
    3 days ago
  • $195k - $235k

    COMPANY OVERVIEW KKR is a leading global investment firm that offers alternative asset...  .... TEAM OVERVIEW KKR’s Global Data Operations team is responsible for collecting, managing...  ...Within Data Operations, the Data Operations Center of Excellence (CoE) operates as a hub-... 
    Full time
    Local area

    Careers at KKR

    New York, NY
    3 days ago
  •  ...Growth Cab is looking for an Operational Excellence Lead to enhance team execution and operational efficiency across systems and workflows. This position involves analyzing processes, identifying bottlenecks, and implementing improvements to ensure operations remain effective... 
    Remote work

    Growth Cab ?

    New York, NY
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to HPC Data Center Operational Lead. Be the first to apply!