Senior Staff Data Center Operations Engineer, GPU Hardware Architecture
$179k - $218kCrusoe
Senior Staff Data Center Operations Engineer, GPU Hardware Architecture
Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster.
We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI.
We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services.
If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe.
The Mission
Crusoe is building the world's most climate-aligned AI infrastructure. As we scale toward unprecedented power densities and liquid-cooled architectures, the gap between "Data Center Design" and "Silicon Reality" must be bridged.
We are seeking a Senior Staff Data Center Operations Engineer, GPU Hardware Architecture to be the definitive technical authority on GPU platforms within the Data Center Engineering and Operations organization. Your mission is twofold: act as the primary technical consultant to our Data Center Engineering team to ensure future facilities are built for next-gen silicon, and provide the Operations team with the specialized tooling, SOPs, and predictive strategies needed to maintain peak cluster health.
The Strategic Bridge
For DC Engineering: You are the internal consultant. You translate upcoming GPU power/thermal roadmaps (NVIDIA/AMD) into design requirements for our next-generation facilities.
For Site Operations: You are the "Technical Enabler." You develop the diagnostic tools and technical SOPs that enable field technicians to resolve complex GPU issues with surgical accuracy.
For Sourcing: You are the "Technical Strategist." You define the technical sparing requirements and site-level inventory needs based on hardware failure telemetry.
Key Responsibilities
Engineering Education & Design Support: Provide deep-dive technical guidance to the Data Center Engineering team on upcoming silicon (e.g., NVIDIA Blackwell/Rubin, AMD MI350/400). Ensure future facility designs for power, cooling, and rack-spacing are ready for 2000W+ per-chip densities.
Predictive Operations & Telemetry: Leverage AI/ML methodologies to analyze fleet-wide telemetry (power draws, thermal gradients, and error rates). You will lead the transition from reactive troubleshooting to predictive maintenance, identifying "pre-failure" patterns in HBM or NVLink components before they impact customer training runs.
Technical Sparing Architecture: Architect the site-level sparing strategy from a technical perspective. Use failure telemetry and MTBF data to define the "Critical Spares List" and stocking levels required at each site to meet cluster uptime targets, providing these requirements to Sourcing for execution.
Operational Tooling & SOPs: Build the "Operational Blueprint" for the field. Create precision SOPs for high-stakes GPU repairs (e.g., baseboard swaps, manifold maintenance) and develop diagnostic tooling that allows Site Ops to identify NVLink flapping, PCIe degradations, or thermal throttling.
Advanced Troubleshooting & RCA: Act as the Tier-3 escalation point for the most complex hardware failures in the production environment. Lead Root Cause Analysis (RCA) on systemic issues that span the boundary between hardware and facility environmental factors.
Silicon Roadmap Authority: Maintain a 24-month forward-looking view of NVIDIA and AMD architectures. Educate internal stakeholders on how transitions in HBM4, interconnect speeds, and liquid-cooling will impact Crusoe's physical infrastructure.
Vendor & VAR Technical Lead: Support the technical relationship with OEMs and VARs. Audit their hardware builds, review their technical bulletins, and ensure their hardware roadmaps align with Crusoe's operational and engineering standards.
Technical Requirements
Silicon & Fabric Mastery: Expert-level knowledge of NVIDIA (Hopper/Blackwell/Rubin) and AMD (Instinct) architectures. Mastery of the physical and logical layers of NVLink, NVSwitch, and InfiniBand.
Infrastructure Bridge-Building: Ability to translate "Silicon Data Sheets" into "Mechanical Engineering Requirements." You can explain how a GPU's specific heat-load profile affects CDU sizing and secondary loop design.
Data-Driven Diagnostics: Proficient in Python, Go, or Bash to build telemetry and health-check tools (utilizing DCGM and ROCm). Experience using large datasets or basic ML frameworks to build "Smart Monitoring" that filters critical health signals from noise.
Operational Reliability Analysis: Experience using failure telemetry to inform site-level sparing requirements and field-service workflows.
Thermal Management: Deep understanding of the operational realities of Direct-to-Chip (D2C) cooling, including fluid dynamics, pressure-drop curves, and the lifecycle of dripless couplings.
Qualifications
10+ years in Hardware Engineering, Systems Architecture, or Data Center Infrastructure.
The "Consultant" Mindset: Proven track record of educating and influencing cross-functional teams (specifically Engineering and Operations).
GPU Authority: You have managed or architected GPU clusters at scale (thousands of nodes) at a hyperscaler, a GPU-specialized cloud, or a major silicon vendor.
Education: B.S. or M.S. in Electrical Engineering, Computer Engineering, or a related technical field.
Benefits:
Competitive compensation
Restricted Stock Units
Paid time off & paid holidays
Comprehensive health, dental & vision insurance
Employer contributions to HSA account
Paid parental leave
Paid life insurance, short-term and long-term disability
Professional development & tuition reimbursement
Mental health & wellness support
Commuter benefits (parking & transit)
Cell phone stipend
401(k) Retirement plan with company match up to 4% of salary
Volunteer time off
Compensation Range Compensation will be paid in the range of up to $179,000 -$218,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicants knowledge, education, and abilities, as well as internal equity and alignment with market data.
Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.
$300 per month
...ground up, we own and operate each layer of the... ..., manufacturing, data center construction, and... ...us at Crusoe. Staff Storage Systems Administrator... ...high‑performance GPU workloads,... ...invest in the best hardware for AI workloads.... ...will sit with their engineering teams to provide...Senior- Baseten is hiring a Network Engineer (Data Centers) in San Francisco to design... ...network infrastructure for their GPU clusters. This senior role collaborates closely with hardware and platform teams, directly... ...include managing network architecture, defining topology, and...SeniorFlexible hours
- ...dedicated to Mechanical or Hardware Engineering , Hardware Fluency: A... ...recruiting metrics, data integrity, and a... ...environment , Flexibility: Operate within a lean, high-... ...Energy is hiring a Senior Staff Technical Recruiter... ...talent , Process Architecture: Collaborate across all...Senior
- ...Senior HPC & GPU Infrastructure Engineer Sciforium is an AI infrastructure company developing next-... ...environment and the linchpin between hardware operations, distributed systems, and... ...Vendor Liaison: Coordinate with data center staff, hardware vendors, and on-site...SeniorFlexible hours
$181.1k - $318.4k
Staff/Sr. Machine Learning Engineer, Foundation Models - AI, Search & Knowledge Platforms... ...of compute from our hardware.As part of this group, you... ...inference for cutting edge model architectures. Work closely with... ...etc. Familiar with GPU programming concepts using...SeniorRelocation- A cutting-edge AI platform company seeks a Senior R&D Engineer in System Architecture. This critical role involves designing and optimizing the hardware architecture for AI accelerators, making use of advanced networking technologies. The ideal candidate will hold a Master...Senior
- ...Francisco, is searching for a Sr. Systems Performance Software Engineer to own the architecture and performance of our full software stack. You'll work on real-time systems and drive performance across CPU, GPU, and memory boundaries. The ideal candidate has 4+ years of...Senior
- A leading technology company based in San Francisco is seeking a Display Electrical Hardware Architect/Engineer to design next-generation display systems for its products. The role involves collaborating with cross-functional teams to develop novel display technologies,...Senior
$154k - $230k
A leading technology firm in San Francisco seeks an experienced hardware development engineer specializing in LIDAR solutions. You will lead the design and production of embedded systems, requiring over 12 years in electrical design, and deep knowledge of circuits and PCBs...Senior- ProducePay is seeking a Senior Staff Technical Program Manager who will own and manage the portfolio of IaaS products, ensuring resource allocation and program coherence. You will lead cross-functional initiatives and mentor other technical program managers. The ideal...Senior
$170.25k - $212.81k
...Energy is hiring a Senior Staff Technical... ...our world-class Engineering organization! You... ...domain expert in hardware who can act as a... ...execute advanced, data-driven sourcing strategies... .... Process Architecture: Collaborate across... ...Flexibility : Operate within a lean,...SeniorFull timeWork at officeRelocation package2 days per week$160k - $200k
Senior Principal Architect Engineer, Power Generation Senior Principal Power Generation... ...across mega-scale data center campuses. Responsibilities... ...BTM generation reference architecture and Basis of Design (BOD... ..., redundancy strategy, operability, and maintainability. Lead...SeniorWork at officeFlexible hours- Agility Robotics is seeking a Staff Robotics Hardware Architect to define and develop high-performance hardware architectures for humanoid robots. You will bridge deep system-level hardware understanding with simulation-driven analysis while collaborating with cross-functional...
- ...advanced systems for indexing and retrieval that leverage machine learning. Ideal candidates will have over 10 years in software engineering, strong leadership skills, and experience with technologies like Kubernetes and AWS. The position offers competitive compensation...SeniorRemote work
- A leading tech company in San Francisco seeks a Senior Staff Software Engineer to architect their Global Benefits Marketplace. The ideal candidate will have over 10 years of software engineering experience with expertise in distributed systems and proficiency in Python...SeniorWork at office
- Patreon is seeking a Senior Staff Backend Platform Engineer to be based in San Francisco. This role involves leading the technical strategy for the... ...candidates will have extensive experience in backend architecture and APIs, as well as a strong understanding of cloud-native...SeniorWork at officeWorldwideFlexible hours2 days per week
- A leading robotics firm in San Francisco is looking for an experienced Electrical Design Engineer. This role involves owning the electrical design of next-generation hardware modules and requires 8+ years of relevant experience. Candidates should have a strong foundation...Senior
$109.2k - $223.4k
...expanding our global data center footprint and... ...accelerating delivery of GPU capacity. We are... ...handover to operations. In this role,... ...facilities, network, hardware, power/thermal,... ...Construction, Engineering, Network, Hardware... ...communication to senior leadership on status...SeniorTemporary workFlexible hours$230k - $284k
...simulation across 15+ U.S. states. Hardware Engineering is a diverse, innovative,... ...trade studies to define the architecture of the next‑generation Waymo... ...recommendations to senior leadership and stakeholders,... ...architectural choices with data and analysis. Work closely...SeniorFull timeRemote work$200k - $250k
...core member of Fluidstack's Data Center Design and Engineering team. You will own the architectural workstream across a multi-site... ...interface with GCs, AHJs, and operations stakeholders. You will be the... ...standards, door and hardware schedules, access control zoning...SeniorFor contractorsLocal area- ...the Role: As the Hardware Lead, you will serve as... ...driving the architecture and implementation of... ...leadership role, you will operate seamlessly across transistor... ...throughput AI acceleration. Data Converter Design:... .... Mentor junior engineers and assist executive leadership...
$156k - $234k
...and high-performance data storage innovation for... ...demanding AI data centers, in industries ranging... ...Hamilton, VP, Solutions Architecture & Engineering | NVIDIADDN is the... ...HyperPOD - a purpose-built, GPU-accelerated solution... ...requirements across hardware (NVIDIA Blackwell...SeniorLocal areaShift work- ...Architect (Enterprise Architecture Practice Lead) - SoFi... ...business, application, data, and technology architecture... ...strategy and operating model. Evaluate existing... ...oversight for data center building blocks and cloud... ...across multiple engineering teams. Requirements...Remote workShift work
- ...Application Security Engineer at vCluster Labs,... ...our multi-tenant architecture. Threat Modeling:... ...with shared GPU resources and multi... ...documentation and "Trust Center" to help our... ...leading platform for operating GPU infrastructure... ...to run their GPU data centers — managed...SeniorRemote workFlexible hoursShift work
- ...technology infrastructure company in San Francisco is seeking an experienced engineer to manage and operate GPU clusters. The role requires over 5 years of hands-on experience, a deep understanding of hardware systems, and a passion for automating fleet operations. You will...Senior
- ...startup in AI is seeking a Senior Infrastructure Engineer in San Francisco, CA. This role... ...building and scaling a GPU Cloud Marketplace, transforming... ...effective collaboration with hardware vendors. Strong skills in Terraform, cloud architecture, and communication across...Senior
- ...based in San Francisco, is seeking a highly skilled kernel engineer to write and optimize GPU kernels that enhance performance for training and... ...close the significant performance gap that exists in modern hardware. The ideal candidate will have a strong background in CUDA...Senior
- A leading tech startup in San Francisco is seeking a Senior Hardware Test Engineer to verify and test custom hardware devices. Suitable candidates will have experience in testing electromechanical devices, strong problem-solving skills, and a Bachelor's degree in a relevant...Senior
$160k - $200k
...Infrastructure Operations Engineer Lightning AI is the company behind PyTorch... ...InfraOps team sits at the center of reliability, automation, and operational scale for GPU infrastructure. This team owns... ...Experience with bare metal hardware troubleshooting and provisioning...Remote workWork from homeFlexible hours$190k - $282k
...Senior Security Production Engineer Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA / San... ...footprint, enabling safe and efficient operations for enterprise and AI workloads at... ...lunch each day in our office and data center locations ~ A casual work environment...SeniorPermanent employmentTemporary workCasual workWork at officeRemote workFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Staff Data Center Operations Engineer, GPU Hardware Architecture. Be the first to apply!
- assistant engineering manager San Francisco, CA
- assistant mechanical engineer San Francisco, CA
- staff data engineer San Francisco, CA
- staff design engineer San Francisco, CA
- engineering aide San Francisco, CA
- software engineer staff San Francisco, CA
- assistant chief engineer San Francisco, CA
- staff automation engineer San Francisco, CA
- project engineer assistant project manager San Francisco, CA
- technology administrator San Francisco, CA


