Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

AI Infrastructure SRE (GPU Cloud / Kubernetes)

Virtual Tech Gurus

Responsibilities Maintain reliability of GPU clusters and AI workloads Monitor systems (Prometheus, Grafana) Automate provisioning and recovery workflows Troubleshoot performance bottlenecks Requirements Strong Linux + scripting (Python/Bash) Experience with Kubernetes (production environments) Observability tools experience Preferred GPU workloads / HPC clusters Slurm or distributed training systems #J-18808-Ljbffr Virtual Tech Gurus

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the AI Infrastructure SRE (GPU Cloud / Kubernetes) in Dallas, TX vacancy
  •  ...hiring for a technical role in Dallas, Texas, focused on maintaining GPU clusters and AI workloads. Candidates should possess strong Linux and scripting skills, as well as experience with Kubernetes in production settings. Responsibilities include monitoring systems with... 
    Cloud

    Virtual Tech Gurus

    Dallas, TX
    2 days ago
  • NorthMark Compute and Cloud LLC is seeking an HPC Kubernetes Solutions Architect to provide customer guidance in designing and integrating GPU-accelerated Kubernetes platforms tailored for HPC. This role requires deep technical expertise in Kubernetes and strong engagement... 
    Cloud

    NorthMark Compute and Cloud LLC

    Dallas, TX
    4 days ago
  • $400 per month

     ...candidates to support a project with a leading AI research lab. This role involves completing...  ...using frontier AI coding agents. With a focus on infrastructure processes, candidates should have 2+ years in DevOps, SRE, or Cloud Engineering, and experience with key cloud... 
    Cloud

    Mercor

    Mesquite, TX
    4 days ago
  • NorthMark Compute & Cloud (NMC²) is backed...  ...computing (HPC) and cloud infrastructure that supports its...  ...Manager, HPC Kubernetes Platform to lead...  ...orchestration layer powering GPU- and CPU-intensive...  ...engineering, AI systems, and high-...  ...-Code, CI/CD, and SRE best practices.... 
    Cloud

    NMC2

    Dallas, TX
    18 hours ago
  •  ...NMC²’s broader compute, cloud, and digital...  ...unified, analytics- and AI-ready data environment....  ...ecosystem across physical infrastructure and higher-level platform...  ...infrastructure, HPC clusters, GPU workloads, job...  ...schedulers (e.g., SLURM, Kubernetes), telemetry, and operational... 
    Cloud

    NMC2

    Dallas, TX
    1 day ago
  •  ...Engineering Manager, AI Compute Platform (...  ...delivering GPU-as-a-Service (GPUaaS...  ..., GPU-accelerated infrastructure in a flexible, multi...  ...a bare-metal Kubernetes platform optimized...  ...infrastructure, HPC, and cloud-like service...  ...Automation, SRE & Platform Operations... 
    Cloud
    Relocation
    Flexible hours

    GTN Technical Staffing

    Dallas, TX
    a month ago
  •  ...Chase within Enterprise technology AI/ML Data Platforms team, you will be...  ...such as Databricks, Snowflake, AWS, Kubernetes, etc. Coordinate incident...  ...Qualifications ~10+ years in an SRE or production support role with AWS Cloud, Databricks, Snowflake or similar Technologies... 
    Cloud

    Chase

    Dallas, TX
    1 day ago
  •  ...expertise in Data Science, Machine Learning, and AI. Our business value and leadership have...  ...Manager with a strong background in infrastructure to join our dynamic team. As a Product...  ...of infrastructure technologies, such as cloud computing, networking, virtualization, and... 
    Cloud

    Tiger Analytics

    Dallas, TX
    18 hours ago
  • $113k - $173k

     ...IT Infrastructure Engineer Addison, TX (Hybrid); Bellevue, WA (Hybrid...  ...with best practices Leverage AI and automation tools...  ..., Site Reliability Engineer, Cloud Engineer or similar, with relevant...  ...other container technologies (Kubernetes, Amazon ECS /EKS ) ~ Demonstrated... 
    Cloud
    Full time
    Live in
    Work at office
    Worldwide
    Flexible hours
    3 days per week

    Tanium

    Addison, TX
    18 hours ago
  •  ...architectures that enable private cloud and hybrid cloud...  ...experience supporting AI‑driven workloads, GPU‑based platforms, and cloud‑like infrastructure, and a proven track...  ...into broader SRE and platform monitoring...  ...Experience supporting Kubernetes, OpenShift, or cloud‑native... 
    Cloud

    Motion Recruitment Partners LLC

    Dallas, TX
    9 days ago
  •  ...Techniques is seeking a skilled professional in Dallas, Texas, to design and optimize GPU-accelerated container platforms. The ideal candidate will have expertise in NVIDIA and Kubernetes ecosystems, with a focus on high-performance workloads. This role includes... 
    Work at office
    Relocation package
    3 days per week

    Career Techniques Inc

    Dallas, TX
    1 day ago
  •  ...Job Title: AI Infrastructure Engineer Location: Remote, USA Job Description This...  ...HPC) environments. Experience with cloud platforms and on-premises infrastructure...  ...technologies such as Docker and Kubernetes. Familiarity with AI frameworks and... 
    Cloud
    Remote work

    United IT Solutions

    Dallas, TX
    2 days ago
  • $148k - $249k

     ...Description Waabi, founded by AI visionary Raquel...  ...and performance of cloud and on-prem environments...  ...tooling (Go/Python/Java, Kubernetes/Docker) for CI/CD-based...  ...monitors, and scales its infrastructure. - Drive execution...  ...streaming/batch/ML platforms; GPU/xPU or Arm performance... 
    Cloud
    Full time
    Work at office
    Work from home
    Flexible hours

    Waabi

    Dallas, TX
    13 days ago
  •  ...Lead complex technology Cloud initiatives including...  ...provisioning of Cloud Infrastructure using Infrastructure as...  ...offerings such as Compute and AI & ML on GCP and/or...  ...Reliability Engineer (SRE) principles • Proficient...  ...2-3 large scale Kubernetes based infrastructure build... 
    Cloud
    Work experience placement

    TriOptus LLC

    Irving, TX
    3 days ago
  • Category Manager- HPC Infrastructure page is loaded## Category Manager- HPC Infrastructurelocations...  ...-performance computing (HPC) and AI data center infrastructure, including GPU/accelerator platforms, compute...  ...supporting hyperscale, AI, cloud, or HPC infrastructure deployments... 
    Cloud
    Contract work

    NorthMark Strategies LLC

    Dallas, TX
    3 days ago
  •  ...Speechify is seeking a Data-focused Software Engineer to enhance our AI model training operations. This role involves sourcing audio data, extending cloud infrastructure on GCP, and collaborating closely with scientists. Ideal candidates have a BS/MS/PhD in Computer Science... 
    Cloud
    Remote work

    Clutch Canada

    Dallas, TX
    18 hours ago
  •  ...Site Reliability Engineer (SRE) with a strong background in Google Cloud Platform (GCP), and...  ...of critical services and infrastructure. Google Cloud Expertise...  ...(Compute Engine, Kubernetes Engine, Cloud Storage, BigQuery...  ...with Google BI and AI/ML tools (Looker,... 
    Cloud

    Sparktek

    Farmers Branch, TX
    3 days ago
  •  ...Position Title: Sr. AI Developer/Architect Location: 600 E. Las...  ...MS Stack Microsoft Azure AI & Cloud Services Azure Azure Data Factory...  ...Machine Learning (Azure ML) Kubernetes / AKS Infrastructure as Code (Terraform / Bicep) LLM Claude Code GPU Workloads Cloud Security &... 
    Cloud
    Full time
    Remote work
    Monday to Friday

    ReqRoute Inc

    Irving, TX
    1 day ago
  • Role: Senior SRE Engineer Location: Washington DC - Hybrid Job Description...  .... You will bridge the gap between infrastructure and applications, leveraging Davis AI and Grail to drive proactive...  ...visibility in a mission-critical, multi-cloud landscape. Core Responsibilities... 
    Cloud
    Work from home
    Flexible hours

    Vytwo

    Dallas, TX
    1 day ago
  •  ...design, and build our SRE foundation from the ground...  ...application and infrastructure monitoring solutions...  ...environments Google Cloud Infrastructure Excellence...  ...methodologies Leverage AI and machine learning for...  ...containerization (Docker, Kubernetes) and orchestration... 
    Cloud
    Remote work

    INFINITE CHOICE LLC

    Dallas, TX
    more than 2 months ago
  • $130k - $200k

     ...Minds is an enterprise AI fine-tuning platform that...  ...on‑prem or on your cloud provider. Our patented...  ...Minds, you will own the infrastructure that makes our AI platform...  ...reliability to GPU performance optimization...  ...and deploy models using Kubernetes and Docker to ensure scalable... 
    Cloud
    Work at office
    Flexible hours

    4MindsAI Inc.

    Dallas, TX
    2 days ago
  • Job Description Cloud SRE Engineer - Associate Who We Look For Goldman Sachs Engineers are...  .... Predictive Observability: Implement AI-driven observability stacks (e.g., Datadog...  ...using Amazon ECS Service Connect. Infrastructure as Code (IaC): Develop and maintain modular... 
    Cloud

    Goldman Sachs

    Dallas, TX
    1 day ago
  •  ...Data & Analytics Office (CDAO) AI/ML & Data Platforms team,...  ...solutions. Through code and cloud infrastructure, you will configure, maintain...  ..., Snowflake, AWS, and Kubernetes Collaborates with other software...  ...work environment to support SRE workflows with strong... 
    Cloud
    Work at office

    Chase

    Dallas, TX
    18 hours ago
  •  ...Title: Cloud Infrastructure Engineer Location: Irving, TX / Concord, CA (Hybrid) Keyskills- Strong implementation & migration exp, Azure...  ...Azure ecosystem. Terraform, knowledge cloud policy , AI foundry. The terraform piece is the most important. someone... 
    Cloud

    RIT Solutions, Inc.

    Irving, TX
    3 days ago
  •  ...NorthMark Compute & Cloud (NMC²) is backed by dedicated...  ...(HPC) and cloud infrastructure that supports its clients...  ...form the backbone of HPC, AI/ML, and data-intensive...  ...Deep understanding of GPU communication frameworks...  ...Cilium, NVIDIA CNI) for HPC/Kubernetes environments. Exposure... 
    Cloud

    NMC2

    Dallas, TX
    1 day ago
  •  ...Building the world’s leading AI‑powered, cloud‑native products that shape...  ...operates the foundational infrastructure layer that powers Confluent...  .... Our platform is built on Kubernetes and runs across a large fleet...  ...effectively with product, SRE/operations, security, and... 
    Cloud

    IBM Computing

    Dallas, TX
    3 days ago
  • $110k - $160k

    We're Topaz Labs, an AI tech company that builds one-click image and video quality software...  ...AI and Gigapixel, ML model training infrastructure and distribution, and our website and...  ...on experience with AWS, Azure, or similar cloud platforms. Experience building and deploying... 
    Cloud
    Full time
    Work experience placement
    Relocation

    Topaz Labs

    Dallas, TX
    3 days ago
  • $140.2k - $185.8k

     .... About WEX and the AI Platform Team: At WEX,...  ...quickly and safely. We're cloud-first, automation-...  ...GitOps practices, and infrastructure automation across AWS and...  ...engineering or DevOps/SRE experience, with at least...  ...orchestration (Docker, Kubernetes, ECS, AKS, or similar).... 
    Cloud
    Remote work
    Flexible hours

    WEX

    Dallas, TX
    1 day ago
  •  ...looking for a hands‑on SRE leader to build and develop...  ...Director of SRE & Cloud Operations, you'll transform...  ...with increasing use of AI and automation to get...  ..., DevOps, or infrastructure engineering, with at least...  ...work. Experience with Kubernetes container orchestration... 
    Cloud

    Paradigm

    Irving, TX
    1 day ago
  • Goldman Sachs is seeking a motivated Cloud Site Reliability Engineer (SRE) in Dallas, Texas. The candidate will be responsible for ensuring the resilience...  ...Key responsibilities include defining SLOs, implementing AI-driven observability, and managing microservices... 
    Cloud
    Full time

    Goldman Sachs

    Dallas, TX
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Infrastructure SRE (GPU Cloud / Kubernetes). Be the first to apply!