Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

AI Infra SRE: GPU Cloud & Kubernetes Reliability

Virtual Tech Gurus

Virtual Tech Gurus is hiring for a technical role in Dallas, Texas, focused on maintaining GPU clusters and AI workloads. Candidates should possess strong Linux and scripting skills, as well as experience with Kubernetes in production settings. Responsibilities include monitoring systems with tools like Prometheus and Grafana, automating workflows, and troubleshooting performance issues. Familiarity with GPU workloads and distributed training systems is preferred. #J-18808-Ljbffr Virtual Tech Gurus

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the AI Infra SRE: GPU Cloud & Kubernetes Reliability in Dallas, TX vacancy
  • Responsibilities Maintain reliability of GPU clusters and AI workloads Monitor systems (Prometheus, Grafana) Automate provisioning and recovery workflows...  ...Strong Linux + scripting (Python/Bash) Experience with Kubernetes (production environments) Observability tools experience... 
    Cloud

    Virtual Tech Gurus

    Dallas, TX
    2 days ago
  • NorthMark Compute and Cloud LLC is seeking an HPC Kubernetes Solutions Architect to provide customer guidance in designing and integrating GPU-accelerated Kubernetes platforms tailored for HPC. This role requires deep technical expertise in Kubernetes and strong engagement... 
    Cloud

    NorthMark Compute and Cloud LLC

    Dallas, TX
    4 days ago
  • NorthMark Compute & Cloud (NMC²) is backed by...  ...Manager, HPC Kubernetes Platform to lead the...  ...orchestration layer powering GPU- and CPU-intensive...  ...performance, reliability, and automation. You...  ...infrastructure engineering, AI systems, and high-...  ...-Code, CI/CD, and SRE best practices.... 
    Cloud

    NMC2

    Dallas, TX
    17 hours ago
  •  ...Job Position:- Site Reliability Engineer Duration:...  ...Site Reliability Engineer (SRE) with a strong background in Google Cloud Platform (GCP), and...  ...services (Compute Engine, Kubernetes Engine, Cloud Storage, BigQuery...  ...with Google BI and AI/ML tools (Looker,... 
    Cloud

    Sparktek

    Farmers Branch, TX
    3 days ago
  • Role: Senior SRE Engineer Location: Washington DC - Hybrid Job Description...  ...and applications, leveraging Davis AI and Grail to drive proactive reliability, mentoring cross-functional DevOps teams...  ...in a mission-critical, multi-cloud landscape. Core Responsibilities Enterprise... 
    Cloud
    Work from home
    Flexible hours

    Vytwo

    Dallas, TX
    1 day ago
  •  ...exceptional Principal Site Reliability Engineer to architect, design, and build our SRE foundation from the...  ...environments Google Cloud Infrastructure Excellence...  ...Leverage AI and machine learning for...  ...containerization (Docker, Kubernetes) and orchestration platforms... 
    Cloud
    Remote work

    INFINITE CHOICE LLC

    Dallas, TX
    more than 2 months ago
  •  ...Site Reliability Engineer III There's nothing more exciting...  ...Office (CDAO) AI/ML & Data Platforms team...  ...solutions. Through code and cloud infrastructure, you...  ...Databricks, Snowflake, AWS, and Kubernetes Collaborates with...  ...environment to support SRE workflows with strong... 
    Cloud
    Work at office

    Chase

    Dallas, TX
    5 days ago
  •  ...Lead Site Reliability Engineer As a Lead Site Reliability...  ...within Enterprise technology AI/ML Data Platforms team,...  ..., Snowflake, AWS, Kubernetes, etc. Coordinate incident...  ...~10+ years in an SRE or production support role with AWS Cloud, Databricks, Snowflake or... 
    Cloud

    Chase

    Dallas, TX
    1 day ago
  •  ...looking for a Manager, Site Reliability Engineering to be part...  ...looking for a hands‑on SRE leader to build and...  ...Senior Director of SRE & Cloud Operations, you'll...  ...with increasing use of AI and automation to get there...  ...work. Experience with Kubernetes container orchestration... 
    Cloud

    Paradigm

    Irving, TX
    1 day ago
  • $400 per month

     ...seeking experienced candidates to support a project with a leading AI research lab. This role involves completing and evaluating...  ...infrastructure processes, candidates should have 2+ years in DevOps, SRE, or Cloud Engineering, and experience with key cloud tools. The... 
    Cloud

    Mercor

    Mesquite, TX
    4 days ago
  •  ...Engineering Manager, AI Compute Platform (...  ...delivering GPU-as-a-Service (GPUaaS...  ...building a bare-metal Kubernetes platform optimized...  ...infrastructure, HPC, and cloud-like service...  ...culture of ownership, reliability, and continuous improvement...  ...) Automation, SRE & Platform... 
    Cloud
    Relocation
    Flexible hours

    GTN Technical Staffing

    Dallas, TX
    a month ago
  • System Reliability Engineer (SRE) 1 —> 3 to 5 years experience Location :- Kansas City, Mi or Atlanta...  ...architecture, infrastructure, and cloud technologies. Proficiency in scripting...  ...technologies (e.g., Docker, Kubernetes). Familiarity with infrastructure as... 
    Cloud

    Highbrow LLC

    Dallas, TX
    3 days ago
  •  ...are seeking a Manager, Cloud Engineering to lead the...  ..., and enabling secure, reliable, and cost‑effective infrastructure...  ...application, platform, SRE, security, and network...  .... Experience with Kubernetes and container platforms...  ...patterns. Exposure to GPU, AI/ML, or high‑performance... 
    Cloud
    Local area

    Texas Instruments

    Dallas, TX
    1 day ago
  •  ...that enable private cloud and hybrid cloud platforms...  ...supporting AI‑driven workloads, GPU‑based platforms, and...  ...observability into broader SRE and platform monitoring...  ...insights to improve reliability, capacity planning, and...  ...supporting Kubernetes, OpenShift, or cloud‑... 
    Cloud

    Motion Recruitment Partners LLC

    Dallas, TX
    9 days ago
  •  ...application development and AI/ML, and our people-first culture...  ...We are looking for a Site Reliability Engineer to ensure the reliability...  ...operation of a multi-cloud application security platform...  ...automation, with a focus on Kubernetes, Terraform, CI/CD, and CSPM technologies... 
    Cloud
    Work at office
    Remote work
    Visa sponsorship
    Work visa
    Flexible hours

    AgileEngine

    Irving, TX
    4 days ago
  • Goldman Sachs is seeking a motivated Cloud Site Reliability Engineer (SRE) in Dallas, Texas. The candidate will be responsible for ensuring the resilience...  ...Key responsibilities include defining SLOs, implementing AI-driven observability, and managing microservices... 
    Cloud
    Full time

    Goldman Sachs

    Dallas, TX
    1 day ago
  •  ...Techniques is seeking a skilled professional in Dallas, Texas, to design and optimize GPU-accelerated container platforms. The ideal candidate will have expertise in NVIDIA and Kubernetes ecosystems, with a focus on high-performance workloads. This role includes... 
    Work at office
    Relocation package
    3 days per week

    Career Techniques Inc

    Dallas, TX
    1 day ago
  •  ...enabling enterprise-scale cloud and DevOps...  ...Automation, DevSecOps, AI/ML Operations (...  ...Deliver DevSecOps and Infra metrics. Team...  ...DevOps — Terraform — Kubernetes — Docker —Ansible...  ...quality, platform reliability, and operational efficiency...  ...management, and SRE best practices. Drive... 
    Cloud
    H1b

    MSH

    Dallas, TX
    4 days ago
  •  ...• Lead complex technology Cloud initiatives including those...  ...offerings such as Compute and AI & ML on GCP and/or Azure...  ..., DevOps concepts, and Site Reliability Engineer (SRE) principles • Proficient...  ...have handled 2-3 large scale Kubernetes based infrastructure build out... 
    Cloud
    Work experience placement

    TriOptus LLC

    Irving, TX
    3 days ago
  • $129.1k - $189.34k

     ...Financial Issuance as a Service (IFIaaS) Cloud Service includes a wide array of...  ...in an on-prem environment. The Sr. Site Reliability Engineer (SRE) will be responsible for ensuring that...  ...providing an integrated platform of scalable, AI-enabled security offerings. We enable... 
    Cloud
    Work at office
    Local area
    Remote work
    Relocation
    Flexible hours
    3 days per week

    Entrust

    Dallas, TX
    3 days ago
  • $151.04k - $234.11k

    ## Senior DevOps Engineer - AWS, Kubernetes, Cloud Infra - Intelligence - RemoteApplylocations: Remote - United Statestime type: Full timeposted...  ...cloud platforms, automating infrastructure, and driving reliability across mission-critical systems? We're looking for a... 
    Cloud
    Remote job

    TBK Bank, SSB

    Dallas, TX
    4 days ago
  •  ...Infrastructure & Operations Engineer – AI/GPU Cloud (US, Remote) We're hiring on behalf...  ...networking Leading incident response, driving SRE practices (SLOs/SLIs, observability,...  ...SRE/DevOps chops: Terraform, Ansible, Kubernetes, Prometheus, Grafana, Python/Bash ~... 
    Cloud
    Permanent employment
    Remote work
    Day shift

    Trust In SODA

    Dallas, TX
    17 hours ago
  • $100.6k - $199k

     ...generation of workloads to our Public Cloud platform. We work together...  ...help you develop skills in AI infrastructure, Cloud services...  ...into Microsoft. As a SRE II in Azure Specialized, you will...  ...will improve the availability, reliability, efficiency, observability, and... 
    Cloud
    Ongoing contract
    Local area

    Microsoft Corporation

    Irving, TX
    2 days ago
  •  ...Position Title: Sr. AI Developer/Architect Location...  ...Microsoft Azure AI & Cloud Services Azure Azure Data...  ...Learning (Azure ML) Kubernetes / AKS Infrastructure as...  ...Bicep) LLM Claude Code GPU Workloads Cloud Security...  ...detection, and operational reliability. Collaboration &... 
    Cloud
    Full time
    Remote work
    Monday to Friday

    ReqRoute Inc

    Irving, TX
    1 day ago
  • $130k - $200k

     ...Mission 4Minds is an enterprise AI fine-tuning platform that...  ...deployed on‑prem or on your cloud provider. Our patented...  ...stack, from inference pipeline reliability to GPU performance optimization across...  ...and deploy models using Kubernetes and Docker to ensure scalable... 
    Cloud
    Work at office
    Flexible hours

    4MindsAI Inc.

    Dallas, TX
    2 days ago
  • Job Description Cloud SRE Engineer - Associate Who We Look For Goldman Sachs Engineers are...  .... We are seeking a motivated Cloud Site Reliability Engineer (SRE) to support the WM Data...  ...stability. Predictive Observability: Implement AI-driven observability stacks (e.g.,... 
    Cloud

    Goldman Sachs

    Dallas, TX
    1 day ago
  • A leading technology company is looking for a System Reliability Engineer (SRE) 1 to ensure the reliability, scalability, and performance of their...  ...the SRE role, strong knowledge of system architecture and cloud technologies, and proficiency in scripting languages like... 
    Cloud

    Highbrow LLC

    Dallas, TX
    3 days ago
  • $103.5k - $172.5k

    Overview SeniorManager, Site Reliability Engineering The Site Reliability...  ...operational aspects, the SRE Sr.Manager actively contributes...  ...telemetry, observability, and AI-driven monitoring solutions....  ...eCommerce platforms with one of the Cloud providers (AWS/Azure/Google... 
    Cloud
    Contract work
    Temporary work
    Shift work

    JCPenney

    Dallas, TX
    1 day ago
  • $103.5k - $172.5k

    JCPenney is seeking a Senior Manager for Site Reliability Engineering in Dallas, Texas. In this hybrid leadership role, you'll oversee the SRE teams, driving productivity and ensuring...  ...should have extensive experience in SRE, cloud technologies, and a strong educational... 
    Cloud

    JCPenney

    Dallas, TX
    1 day ago
  • TBK Bank, SSB is seeking a Senior DevOps Engineer in Dallas, TX, or remotely to shape the cloud infrastructure strategy, focusing on AWS and Kubernetes to drive platform reliability for a fast-growing FinTech organization. The ideal candidate will have at least 5 years... 
    Cloud
    Remote job

    TBK Bank, SSB

    Dallas, TX
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Infra SRE: GPU Cloud & Kubernetes Reliability. Be the first to apply!