AI Infrastructure SRE (GPU Cloud / Kubernetes)
Virtual Tech Gurus
Responsibilities Maintain reliability of GPU clusters and AI workloads Monitor systems (Prometheus, Grafana) Automate provisioning and recovery workflows Troubleshoot performance bottlenecks Requirements Strong Linux + scripting (Python/Bash) Experience with Kubernetes (production environments) Observability tools experience Preferred GPU workloads / HPC clusters Slurm or distributed training systems #J-18808-Ljbffr Virtual Tech Gurus
Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the AI Infrastructure SRE (GPU Cloud / Kubernetes) in Dallas, TX vacancy
- ...hiring for a technical role in Dallas, Texas, focused on maintaining GPU clusters and AI workloads. Candidates should possess strong Linux and scripting skills, as well as experience with Kubernetes in production settings. Responsibilities include monitoring systems with...Cloud
- NorthMark Compute and Cloud LLC is seeking an HPC Kubernetes Solutions Architect to provide customer guidance in designing and integrating GPU-accelerated Kubernetes platforms tailored for HPC. This role requires deep technical expertise in Kubernetes and strong engagement...Cloud
$400 per month
...candidates to support a project with a leading AI research lab. This role involves completing... ...using frontier AI coding agents. With a focus on infrastructure processes, candidates should have 2+ years in DevOps, SRE, or Cloud Engineering, and experience with key cloud...Cloud- NorthMark Compute & Cloud (NMC²) is backed... ...computing (HPC) and cloud infrastructure that supports its... ...Manager, HPC Kubernetes Platform to lead... ...orchestration layer powering GPU- and CPU-intensive... ...engineering, AI systems, and high-... ...-Code, CI/CD, and SRE best practices....Cloud
- ...NMC²’s broader compute, cloud, and digital... ...unified, analytics- and AI-ready data environment.... ...ecosystem across physical infrastructure and higher-level platform... ...infrastructure, HPC clusters, GPU workloads, job... ...schedulers (e.g., SLURM, Kubernetes), telemetry, and operational...Cloud
- ...Engineering Manager, AI Compute Platform (... ...delivering GPU-as-a-Service (GPUaaS... ..., GPU-accelerated infrastructure in a flexible, multi... ...a bare-metal Kubernetes platform optimized... ...infrastructure, HPC, and cloud-like service... ...Automation, SRE & Platform Operations...CloudRelocationFlexible hours
- ...Chase within Enterprise technology AI/ML Data Platforms team, you will be... ...such as Databricks, Snowflake, AWS, Kubernetes, etc. Coordinate incident... ...Qualifications ~10+ years in an SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies...Cloud
- ...expertise in Data Science, Machine Learning, and AI. Our business value and leadership have... ...Manager with a strong background in infrastructure to join our dynamic team. As a Product... ...of infrastructure technologies, such as cloud computing, networking, virtualization, and...Cloud
$113k - $173k
...IT Infrastructure Engineer Addison, TX (Hybrid); Bellevue, WA (Hybrid... ...with best practices Leverage AI and automation tools... ..., Site Reliability Engineer, Cloud Engineer or similar, with relevant... ...other container technologies (Kubernetes, Amazon ECS /EKS ) ~ Demonstrated...CloudFull timeLive inWork at officeWorldwideFlexible hours3 days per week- ...architectures that enable private cloud and hybrid cloud... ...experience supporting AI‑driven workloads, GPU‑based platforms, and cloud‑like infrastructure, and a proven track... ...into broader SRE and platform monitoring... ...Experience supporting Kubernetes, OpenShift, or cloud‑native...Cloud
- ...Techniques is seeking a skilled professional in Dallas, Texas, to design and optimize GPU-accelerated container platforms. The ideal candidate will have expertise in NVIDIA and Kubernetes ecosystems, with a focus on high-performance workloads. This role includes...Work at officeRelocation package3 days per week
- ...Job Title: AI Infrastructure Engineer Location: Remote, USA Job Description This... ...HPC) environments. Experience with cloud platforms and on-premises infrastructure... ...technologies such as Docker and Kubernetes. Familiarity with AI frameworks and...CloudRemote work
$148k - $249k
...Description Waabi, founded by AI visionary Raquel... ...and performance of cloud and on-prem environments... ...tooling (Go/Python/Java, Kubernetes/Docker) for CI/CD-based... ...monitors, and scales its infrastructure. - Drive execution... ...streaming/batch/ML platforms; GPU/xPU or Arm performance...CloudFull timeWork at officeWork from homeFlexible hours- ...Lead complex technology Cloud initiatives including... ...provisioning of Cloud Infrastructure using Infrastructure as... ...offerings such as Compute and AI & ML on GCP and/or... ...Reliability Engineer (SRE) principles • Proficient... ...2-3 large scale Kubernetes based infrastructure build...CloudWork experience placement
- Category Manager- HPC Infrastructure page is loaded## Category Manager- HPC Infrastructurelocations... ...-performance computing (HPC) and AI data center infrastructure, including GPU/accelerator platforms, compute... ...supporting hyperscale, AI, cloud, or HPC infrastructure deployments...CloudContract work
- ...Speechify is seeking a Data-focused Software Engineer to enhance our AI model training operations. This role involves sourcing audio data, extending cloud infrastructure on GCP, and collaborating closely with scientists. Ideal candidates have a BS/MS/PhD in Computer Science...CloudRemote work
- ...Site Reliability Engineer (SRE) with a strong background in Google Cloud Platform (GCP), and... ...of critical services and infrastructure. Google Cloud Expertise... ...(Compute Engine, Kubernetes Engine, Cloud Storage, BigQuery... ...with Google BI and AI/ML tools (Looker,...Cloud
- ...Position Title: Sr. AI Developer/Architect Location: 600 E. Las... ...MS Stack Microsoft Azure AI & Cloud Services Azure Azure Data Factory... ...Machine Learning (Azure ML) Kubernetes / AKS Infrastructure as Code (Terraform / Bicep) LLM Claude Code GPU Workloads Cloud Security &...CloudFull timeRemote workMonday to Friday
- Role: Senior SRE Engineer Location: Washington DC - Hybrid Job Description... .... You will bridge the gap between infrastructure and applications, leveraging Davis AI and Grail to drive proactive... ...visibility in a mission-critical, multi-cloud landscape. Core Responsibilities...CloudWork from homeFlexible hours
- ...design, and build our SRE foundation from the ground... ...application and infrastructure monitoring solutions... ...environments Google Cloud Infrastructure Excellence... ...methodologies Leverage AI and machine learning for... ...containerization (Docker, Kubernetes) and orchestration...CloudRemote work
$130k - $200k
...Minds is an enterprise AI fine-tuning platform that... ...on‑prem or on your cloud provider. Our patented... ...Minds, you will own the infrastructure that makes our AI platform... ...reliability to GPU performance optimization... ...and deploy models using Kubernetes and Docker to ensure scalable...CloudWork at officeFlexible hours- Job Description Cloud SRE Engineer - Associate Who We Look For Goldman Sachs Engineers are... .... Predictive Observability: Implement AI-driven observability stacks (e.g., Datadog... ...using Amazon ECS Service Connect. Infrastructure as Code (IaC): Develop and maintain modular...Cloud
- ...Data & Analytics Office (CDAO) AI/ML & Data Platforms team,... ...solutions. Through code and cloud infrastructure, you will configure, maintain... ..., Snowflake, AWS, and Kubernetes Collaborates with other software... ...work environment to support SRE workflows with strong...CloudWork at office
- ...Title: Cloud Infrastructure Engineer Location: Irving, TX / Concord, CA (Hybrid) Keyskills- Strong implementation & migration exp, Azure... ...Azure ecosystem. Terraform, knowledge cloud policy , AI foundry. The terraform piece is the most important. someone...Cloud
- ...NorthMark Compute & Cloud (NMC²) is backed by dedicated... ...(HPC) and cloud infrastructure that supports its clients... ...form the backbone of HPC, AI/ML, and data-intensive... ...Deep understanding of GPU communication frameworks... ...Cilium, NVIDIA CNI) for HPC/Kubernetes environments. Exposure...Cloud
- ...Building the world’s leading AI‑powered, cloud‑native products that shape... ...operates the foundational infrastructure layer that powers Confluent... .... Our platform is built on Kubernetes and runs across a large fleet... ...effectively with product, SRE/operations, security, and...Cloud
$110k - $160k
We're Topaz Labs, an AI tech company that builds one-click image and video quality software... ...AI and Gigapixel, ML model training infrastructure and distribution, and our website and... ...on experience with AWS, Azure, or similar cloud platforms. Experience building and deploying...CloudFull timeWork experience placementRelocation$140.2k - $185.8k
.... About WEX and the AI Platform Team: At WEX,... ...quickly and safely. We're cloud-first, automation-... ...GitOps practices, and infrastructure automation across AWS and... ...engineering or DevOps/SRE experience, with at least... ...orchestration (Docker, Kubernetes, ECS, AKS, or similar)....CloudRemote workFlexible hours- ...looking for a hands‑on SRE leader to build and develop... ...Director of SRE & Cloud Operations, you'll transform... ...with increasing use of AI and automation to get... ..., DevOps, or infrastructure engineering, with at least... ...work. Experience with Kubernetes container orchestration...Cloud
- Goldman Sachs is seeking a motivated Cloud Site Reliability Engineer (SRE) in Dallas, Texas. The candidate will be responsible for ensuring the resilience... ...Key responsibilities include defining SLOs, implementing AI-driven observability, and managing microservices...CloudFull time
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to AI Infrastructure SRE (GPU Cloud / Kubernetes). Be the first to apply!
Related searches

