Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

GPU Infrastructure Architect - GPU Cluster Stand-up, Configuration & Operations

United IT

GPU Infrastructure Architect

The GPU Cluster Architect is responsible for designing, provisioning, and operating AMD MI350–based GPU clusters on a cloud platform. The role ensures scalable, secure, and reproducible GPU infrastructure to support distributed training and high-performance workloads.

Key Responsibilities

  • Design end-to-end GPU cluster architecture covering compute, networking, storage, and control services.
  • Provision and operationalize up to 9 AMD MI350 GPU clusters based on confirmed cloud SKU availability.
  • Configure GPU compute nodes including base OS images, GPU drivers, runtime libraries, and distributed training dependencies.
  • Implement automation for node imaging, bootstrapping, lifecycle management, patching, and upgrades.
  • Standardize environments using reproducible builds and Infrastructure-as-Code (IaC).
  • Enable workload portability through containerized environments and documented deployment patterns.
  • Implement OS baseline hardening, restricted administrative access, and secure cluster access controls.
  • Establish monitoring, logging, and operational runbooks to ensure reliability and performance.
Vacancy posted 12 hours ago
Similar jobs that could be interesting for youBased on the GPU Infrastructure Architect - GPU Cluster Stand-up, Configuration & Operations in San Jose, CA vacancy
  • $272k - $431.25k

     ...era in which our GPU acts as the brains...  ...! As a Principal Architect on our powerful...  ...vision for our modern infrastructure while working...  ...editing and bulk operations. Driving platform...  .... Ways to stand out from the crowd...  ...release engineering or configuration management... 
    Operations

    NVIDIA

    Santa Clara, CA
    2 days ago
  • $256k - $414k

     ..., scaling, and operations of high‑performance...  ...networking for GPU‑based cloud infrastructure. This role is...  ...of network architects focused on high...  ...design of intra‑cluster and inter‑cluster...  ...large‑scale configurations using SR‑IOV, Xen...  ...GDPR. Ways To Stand Out From The Crowd... 
    Operations
    Local area

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $272k - $431.25k

     ...Principal Ai And Ml Infra Software Engineer, Gpu Clusters We are seeking a Principal AI and...  ...at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will have a...  .... ~ Hands-on experience in using or operating High Performance Computing (HPC) grade... 
    Operations

    NVIDIA

    Santa Clara, CA
    2 days ago
  •  ...Title: Infrastructure Architect (AI & Data Center) Location: San Jose, CA, onsite...  ...of our next-generation GPU Farm and AI Factory...  ...assets and their long-term operational health. You will bridge the...  ...Architect the Management Cluster control plane (NKP, Prism... 
    Operations

    Trilyon, Inc.

    San Jose, CA
    2 days ago
  • $152k - $241.5k

     ...era in which our GPU acts as the brains...  ...and experienced HPC Cluster Engineer to design, deploy, and operate GPU Compute...  ...with researchers and infrastructure teams to ensure our...  ...including cluster configuration managements tools...  ...fields. Ways to stand out from the crowd... 
    Operations

    NVIDIA

    Santa Clara, CA
    4 days ago
  •  ...Infrastructure Solutions Architect Proficient at the formulation of architectures, including...  ..., and architectural configuration. Years of experience: 13+ years...  ...infrastructure design and operations including IP, FTP, load balancing, clustering, failover, monitoring,... 
    Operations

    Omega Solutions

    Santa Clara, CA
    9 days ago
  • $184k - $287.5k

     ...Network Solutions Architect Engineer to help...  ...data center GPU and networking deployments...  ..., network, and cluster infrastructure in customer data...  ...and debug configuration and performance...  ...summaries. Ways to stand out from the...  ...bringing up and operating large clusters... 
    Remote work

    NVIDIA

    Santa Clara, CA
    15 days ago
  •  ...THE TEAM AMD's Data Center GPU organization is transforming...  ...risks. Establish a repeatable operating cadence (forecasts,...  ...experience with rack‑scale GPU/AI infrastructure, hyperscale deployment models...  ...commodity constraints, and configuration management across partner builds... 
    Operations
    Contract work
    Remote work

    Advanced Micro Devices

    Santa Clara, CA
    2 days ago
  • What You’ll Do * Own and deliver GPU infrastructure programs spanning datacenter enablement, GPU server provisioning, deployment, configuration, and lifecycle operations. * Align priorities and execution across multiple teams (engineering, networking, datacenter... 
    Operations
    Full time
    Flexible hours

    Oracle

    Santa Clara, CA
    2 days ago
  • $224k - $431.25k

    NVIDIA Corporation in Santa Clara, CA, seeks a senior technical leader to recruit and manage engineering teams focusing on large-scale GPU and AI networking projects. Applicants should possess a BS/MS/PhD in relevant engineering fields, with 8+ years of relevant... 

    NVIDIA Corporation

    Santa Clara, CA
    12 hours ago
  •  ...Infrastructure Architect Location: Milpitas, CA The Company: FireEye is the intelligence-led...  ...extension of customer security operations, FireEye offers a single platform that...  ...library of standardized procedures and configurations Work directly with business users... 
    Operations
    Temporary work
    Work at office
    Flexible hours
    Night shift

    Netpace

    Milpitas, CA
    4 days ago
  •  ...Cloud is a strong plus), with a focus on GPU-enabled infrastructure. This role will lead architecture and...  .../OpenShift, with GitOps-based operations. What Success Looks Like (First 6-...  ...optimization. • Experience with multi-cluster/multi-region platform design. • Prior... 
    Operations

    IBM

    San Jose, CA
    2 days ago
  •  ...Network Architect Remote – USA / Onsite – Santa Clara...  ...interconnects for GPU-accelerated data centers and compute clusters. Outstanding problem-solving...  ...languages used in infrastructure automation. SME in...  ...Experience using an automated configuration management system (... 
    Remote work

    Omni Inclusive

    Santa Clara, CA
    4 days ago
  •  ...Network Cluster Architect - Data Center Infrastructure Work Locations (2) Submit Resume The Data Center Hardware Engineering team is responsible...  ...perspectives, including cluster network architectures, rack configurations, and scale-out requirements for dense compute and AI... 

    Apple

    Cupertino, CA
    3 days ago
  • $124k - $195.5k

     ...We are now looking for a GPU system and process scheduling architect! NVIDIA is seeking outstanding system architects to design...  ...experience in Computer Architecture, Operation systems, and/or system architecture Ways to stand out from the crowd: Knowledge of memory... 
    Operations
    Work experience placement

    NVIDIA

    Santa Clara, CA
    12 hours ago
  • $272k - $431.25k

     ...about developing cloud infrastructure, we want to hear from...  ...and creative solutions architect with experience in...  ...interconnect to join the NVIDIA GPU Cloud Infrastructure...  ...: Design and operate end-to-end network...  ...concepts Ways to stand out from the crowd:... 
    Operations
    Remote work

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $168k - $258.75k

     ...era of computing, where our GPU acts as the brains of computers...  ...with partners, engineering, operations, and field teams to develop comprehensive...  ...with cloud partners Ways To Stand Out From The Crowd Strong...  ...or leading cloud computing infrastructure and technologies Exposure to... 
    Operations

    NVIDIA

    Santa Clara, CA
    2 days ago
  •  ...hardware company in Sunnyvale is seeking a Network Architect to design robust interconnect architectures for AI clusters. The role demands extensive experience with...  ...and internal teams to ensure efficient network operations. This role offers an opportunity to impact cutting... 
    Operations

    Cerebras

    Sunnyvale, CA
    1 day ago
  • $224k - $356.5k

     ...developers use highly-optimized mathematical operations on all hardware available in NVIDIA...  ...libraries and want to build software that will stand the test-of-time as it accelerates...  ...Experience with cross-compilation, setting up CPU/GPU/accelerator (cross-)compilation... 
    Operations
    Flexible hours
    Shift work

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $320k

     ...development of DGX Cloud strategy for GPU fleet lifecycle, health,...  ...accelerated computing infrastructure that enables customers with the highest availability and operational standards.**What You’ll Be Doing...  ...computing customers.**Ways to Stand Out from the Crowd:***... 
    Operations
    Worldwide

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $263k - $341k

     ...Technical Staff-Network Architect From applied...  ...generation large-scale AI Infrastructure to include accelerated...  ...Graphics Processing Unit (GPU)-GPU, server-storage,...  ...network topology and configuration to maximize performance of AI clusters • Work closely with key... 

    Dell

    Santa Clara, CA
    3 days ago
  • A leading technology firm is seeking a GPU IP Engineering Program Manager for their Santa Clara, California office. The role involves...  ...high-level activities for GPU IP development, managing operations, and providing strategic support to senior leaders. Candidates... 
    Operations
    Work at office

    Intel Corporation

    Santa Clara, CA
    12 hours ago
  • $136k - $212.75k

    NVIDIA AI in Santa Clara, California is seeking a qualified individual to join the Operations Product Development Engineering GPU Team. In this role, you will be responsible for productizing NVIDIA’s chips across various markets, focusing on Silicon and Package qualification... 
    Operations

    NVIDIA AI

    Santa Clara, CA
    12 hours ago
  •  ...role in Santa Clara, CA. The ideal candidate will troubleshoot GPU performance issues, manage service tickets, and support documentation...  ...experience. This role entails hands-on support and administrative tasks to ensure smooth operations. #J-18808-Ljbffr Jobs via Dice
    Operations

    Jobs via Dice

    Santa Clara, CA
    4 days ago
  •  ...Devices is seeking a principal software developer to join the ROCm GPU-compute team in Santa Clara, California. The ideal candidate...  ...role involves developing advanced software for mathematical operations on GPUs, leading a small team, and optimizing performance. Join... 
    Operations

    Advanced Micro Devices

    Santa Clara, CA
    2 days ago
  • A leading technology company is seeking a seasoned professional in cloud computing to manage GPU capacity and drive HPC initiatives. This role involves collaborating with engineering teams, designing data models for governance, and improving resource efficiency through... 
    Operations

    NVIDIA

    Santa Clara, CA
    3 days ago
  • A leading technology company is seeking a Senior Manager for GPU Cloud Infrastructure to lead the design and operations of high-performance networking. In this role, you will build a specialized team, oversee network connectivity, and engage with ISPs for low-latency solutions... 
    Operations

    NVIDIA Corporation

    Santa Clara, CA
    4 days ago
  • $179k - $218k

     ...vertically integrated AI infrastructure company built from the...  ...ground up, we own and operate each layer of the...  ...Operations Engineer, GPU Hardware Architecture...  ...needed to maintain peak cluster health.   The Strategic...  ...Architecture: Architect the site-level sparing... 
    Operations
    Temporary work

    Crusoe

    Sunnyvale, CA
    9 days ago
  •  ...for effective coverage and density planning. Configure and deploy access points (Client), controllers, and related infrastructure components to ensure seamless connectivity across campus facilities. Operations & Maintenance: Perform regular system checks... 
    Operations
    For contractors
    Remote work

    Macpower Digital Assets Edge

    Santa Clara, CA
    2 days ago
  • $184k - $287.5k

     ...Senior Solutions Architect - AI Factory...  ...join our NVIDIA Infrastructure Specialists team...  ...on Linux-based GPU clusters, using NCCL and...  ...clusters. Ensure configurations align with guidelines...  .... Ways to Stand Out From the...  ...deployment, or operations. Background in... 
    Operations
    Remote work

    NVIDIA

    Santa Clara, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to GPU Infrastructure Architect - GPU Cluster Stand-up, Configuration & Operations. Be the first to apply!