AI Infra SRE: GPU Cloud & Kubernetes Reliability

Virtual Tech Gurus

Virtual Tech Gurus is hiring for a technical role in Dallas, Texas, focused on maintaining GPU clusters and AI workloads. Candidates should possess strong Linux and scripting skills, as well as experience with Kubernetes in production settings. Responsibilities include monitoring systems with tools like Prometheus and Grafana, automating workflows, and troubleshooting performance issues. Familiarity with GPU workloads and distributed training systems is preferred. #J-18808-Ljbffr Virtual Tech Gurus

Apply

Vacancy posted 2 days ago

Similar jobs that could be interesting for youBased on the AI Infra SRE: GPU Cloud & Kubernetes Reliability in Dallas, TX vacancy

AI Infrastructure SRE (GPU Cloud / Kubernetes)
Responsibilities Maintain reliability of GPU clusters and AI workloads Monitor systems (Prometheus, Grafana) Automate provisioning and recovery workflows... ...Strong Linux + scripting (Python/Bash) Experience with Kubernetes (production environments) Observability tools experience...
Cloud
Virtual Tech Gurus
Dallas, TX
2 days ago
HPC Kubernetes Architect for GPU & AI/ML Platforms
NorthMark Compute and Cloud LLC is seeking an HPC Kubernetes Solutions Architect to provide customer guidance in designing and integrating GPU-accelerated Kubernetes platforms tailored for HPC. This role requires deep technical expertise in Kubernetes and strong engagement...
Cloud
NorthMark Compute and Cloud LLC
Dallas, TX
4 days ago
Engineering Manager, HPC Kubernetes Platform
NorthMark Compute & Cloud (NMC²) is backed by... ...Manager, HPC Kubernetes Platform to lead the... ...orchestration layer powering GPU- and CPU-intensive... ...performance, reliability, and automation. You... ...infrastructure engineering, AI systems, and high-... ...-Code, CI/CD, and SRE best practices....
Cloud
NMC2
Dallas, TX
17 hours ago
Site Reliability Engineer
...Job Position:- Site Reliability Engineer Duration:... ...Site Reliability Engineer (SRE) with a strong background in Google Cloud Platform (GCP), and... ...services (Compute Engine, Kubernetes Engine, Cloud Storage, BigQuery... ...with Google BI and AI/ML tools (Looker,...
Cloud
Sparktek
Farmers Branch, TX
3 days ago
Senior SRE (Site Reliability Engineer)
Role: Senior SRE Engineer Location: Washington DC - Hybrid Job Description... ...and applications, leveraging Davis AI and Grail to drive proactive reliability, mentoring cross-functional DevOps teams... ...in a mission-critical, multi-cloud landscape. Core Responsibilities Enterprise...
Cloud
Work from home
Flexible hours
Vytwo
Dallas, TX
1 day ago
Principal Site Reliability Engineer (SRE)
...exceptional Principal Site Reliability Engineer to architect, design, and build our SRE foundation from the... ...environments Google Cloud Infrastructure Excellence... ...Leverage AI and machine learning for... ...containerization (Docker, Kubernetes) and orchestration platforms...
Cloud
Remote work
INFINITE CHOICE LLC
Dallas, TX
more than 2 months ago
Site Reliability Engineer III
...Site Reliability Engineer III There's nothing more exciting... ...Office (CDAO) AI/ML & Data Platforms team... ...solutions. Through code and cloud infrastructure, you... ...Databricks, Snowflake, AWS, and Kubernetes Collaborates with... ...environment to support SRE workflows with strong...
Cloud
Work at office
Chase
Dallas, TX
5 days ago
Senior Lead Software Engineer- SRE
...Lead Site Reliability Engineer As a Lead Site Reliability... ...within Enterprise technology AI/ML Data Platforms team,... ..., Snowflake, AWS, Kubernetes, etc. Coordinate incident... ...~10+ years in an SRE or production support role with AWS Cloud, Databricks, Snowflake or...
Cloud
Chase
Dallas, TX
1 day ago
Manager, Site Reliability Engineering
...looking for a Manager, Site Reliability Engineering to be part... ...looking for a hands‑on SRE leader to build and... ...Senior Director of SRE & Cloud Operations, you'll... ...with increasing use of AI and automation to get there... ...work. Experience with Kubernetes container orchestration...
Cloud
Paradigm
Irving, TX
1 day ago
AI-Driven DevOps/SRE Engineer - Cloud & Kubernetes
$400 per month
...seeking experienced candidates to support a project with a leading AI research lab. This role involves completing and evaluating... ...infrastructure processes, candidates should have 2+ years in DevOps, SRE, or Cloud Engineering, and experience with key cloud tools. The...
Cloud
Mercor
Mesquite, TX
4 days ago
Engineering Manager, HPC Kubernetes Platform
...Engineering Manager, AI Compute Platform (... ...delivering GPU-as-a-Service (GPUaaS... ...building a bare-metal Kubernetes platform optimized... ...infrastructure, HPC, and cloud-like service... ...culture of ownership, reliability, and continuous improvement... ...) Automation, SRE & Platform...
Cloud
Relocation
Flexible hours
GTN Technical Staffing
Dallas, TX
a month ago
System Reliability Engineer (SRE) 1
System Reliability Engineer (SRE) 1 —> 3 to 5 years experience Location :- Kansas City, Mi or Atlanta... ...architecture, infrastructure, and cloud technologies. Proficiency in scripting... ...technologies (e.g., Docker, Kubernetes). Familiarity with infrastructure as...
Cloud
Highbrow LLC
Dallas, TX
3 days ago
Cloud Engineering Manager
...are seeking a Manager, Cloud Engineering to lead the... ..., and enabling secure, reliable, and cost‑effective infrastructure... ...application, platform, SRE, security, and network... .... Experience with Kubernetes and container platforms... ...patterns. Exposure to GPU, AI/ML, or high‑performance...
Cloud
Local area
Texas Instruments
Dallas, TX
1 day ago
Network Director
...that enable private cloud and hybrid cloud platforms... ...supporting AI‑driven workloads, GPU‑based platforms, and... ...observability into broader SRE and platform monitoring... ...insights to improve reliability, capacity planning, and... ...supporting Kubernetes, OpenShift, or cloud‑...
Cloud
Motion Recruitment Partners LLC
Dallas, TX
9 days ago
DevOps / Site Reliability Engineer
...application development and AI/ML, and our people-first culture... ...We are looking for a Site Reliability Engineer to ensure the reliability... ...operation of a multi-cloud application security platform... ...automation, with a focus on Kubernetes, Terraform, CI/CD, and CSPM technologies...
Cloud
Work at office
Remote work
Visa sponsorship
Work visa
Flexible hours
AgileEngine
Irving, TX
4 days ago
Cloud SRE Associate - AWS, SLOs & Predictive Observability
Goldman Sachs is seeking a motivated Cloud Site Reliability Engineer (SRE) in Dallas, Texas. The candidate will be responsible for ensuring the resilience... ...Key responsibilities include defining SLOs, implementing AI-driven observability, and managing microservices...
Cloud
Full time
Goldman Sachs
Dallas, TX
1 day ago
GPU-Optimized Kubernetes Engineer for AI/ML
...Techniques is seeking a skilled professional in Dallas, Texas, to design and optimize GPU-accelerated container platforms. The ideal candidate will have expertise in NVIDIA and Kubernetes ecosystems, with a focus on high-performance workloads. This role includes...
Work at office
Relocation package
3 days per week
Career Techniques Inc
Dallas, TX
1 day ago
DevOps Practice Lead
...enabling enterprise-scale cloud and DevOps... ...Automation, DevSecOps, AI/ML Operations (... ...Deliver DevSecOps and Infra metrics. Team... ...DevOps — Terraform — Kubernetes — Docker —Ansible... ...quality, platform reliability, and operational efficiency... ...management, and SRE best practices. Drive...
Cloud
H1b
MSH
Dallas, TX
4 days ago
DevOps Engineer
...• Lead complex technology Cloud initiatives including those... ...offerings such as Compute and AI & ML on GCP and/or Azure... ..., DevOps concepts, and Site Reliability Engineer (SRE) principles • Proficient... ...have handled 2-3 large scale Kubernetes based infrastructure build out...
Cloud
Work experience placement
TriOptus LLC
Irving, TX
3 days ago
Senior Site Reliability Engineer
$129.1k - $189.34k
...Financial Issuance as a Service (IFIaaS) Cloud Service includes a wide array of... ...in an on-prem environment. The Sr. Site Reliability Engineer (SRE) will be responsible for ensuring that... ...providing an integrated platform of scalable, AI-enabled security offerings. We enable...
Cloud
Work at office
Local area
Remote work
Relocation
Flexible hours
3 days per week
Entrust
Dallas, TX
3 days ago
Senior DevOps Engineer - AWS, Kubernetes, Cloud Infra - Intelligence - Remote
$151.04k - $234.11k
## Senior DevOps Engineer - AWS, Kubernetes, Cloud Infra - Intelligence - RemoteApplylocations: Remote - United Statestime type: Full timeposted... ...cloud platforms, automating infrastructure, and driving reliability across mission-critical systems? We're looking for a...
Cloud
Remote job
TBK Bank, SSB
Dallas, TX
4 days ago
Infra Ops Tech Lead
...Infrastructure & Operations Engineer – AI/GPU Cloud (US, Remote) We're hiring on behalf... ...networking Leading incident response, driving SRE practices (SLOs/SLIs, observability,... ...SRE/DevOps chops: Terraform, Ansible, Kubernetes, Prometheus, Grafana, Python/Bash ~...
Cloud
Permanent employment
Remote work
Day shift
Trust In SODA
Dallas, TX
17 hours ago
Site Reliability Engineer
$100.6k - $199k
...generation of workloads to our Public Cloud platform. We work together... ...help you develop skills in AI infrastructure, Cloud services... ...into Microsoft. As a SRE II in Azure Specialized, you will... ...will improve the availability, reliability, efficiency, observability, and...
Cloud
Ongoing contract
Local area
Microsoft Corporation
Irving, TX
2 days ago
Artificial Intelligence Engineer
...Position Title: Sr. AI Developer/Architect Location... ...Microsoft Azure AI & Cloud Services Azure Azure Data... ...Learning (Azure ML) Kubernetes / AKS Infrastructure as... ...Bicep) LLM Claude Code GPU Workloads Cloud Security... ...detection, and operational reliability. Collaboration &...
Cloud
Full time
Remote work
Monday to Friday
ReqRoute Inc
Irving, TX
1 day ago
Machine Learning Operations Engineer
$130k - $200k
...Mission 4Minds is an enterprise AI fine-tuning platform that... ...deployed on‑prem or on your cloud provider. Our patented... ...stack, from inference pipeline reliability to GPU performance optimization across... ...and deploy models using Kubernetes and Docker to ensure scalable...
Cloud
Work at office
Flexible hours
4MindsAI Inc.
Dallas, TX
2 days ago
Asset & Wealth Management-Cloud SRE Engineer-Associate-Dallas
Job Description Cloud SRE Engineer - Associate Who We Look For Goldman Sachs Engineers are... .... We are seeking a motivated Cloud Site Reliability Engineer (SRE) to support the WM Data... ...stability. Predictive Observability: Implement AI-driven observability stacks (e.g.,...
Cloud
Goldman Sachs
Dallas, TX
1 day ago
SRE I: Reliability, Cloud & Automation Engineer
A leading technology company is looking for a System Reliability Engineer (SRE) 1 to ensure the reliability, scalability, and performance of their... ...the SRE role, strong knowledge of system architecture and cloud technologies, and proficiency in scripting languages like...
Cloud
Highbrow LLC
Dallas, TX
3 days ago
Senior Manager, Site Reliability Engineering
$103.5k - $172.5k
Overview SeniorManager, Site Reliability Engineering The Site Reliability... ...operational aspects, the SRE Sr.Manager actively contributes... ...telemetry, observability, and AI-driven monitoring solutions.... ...eCommerce platforms with one of the Cloud providers (AWS/Azure/Google...
Cloud
Contract work
Temporary work
Shift work
JCPenney
Dallas, TX
1 day ago
Senior SRE Manager: Automation, Reliability & Leadership
$103.5k - $172.5k
JCPenney is seeking a Senior Manager for Site Reliability Engineering in Dallas, Texas. In this hybrid leadership role, you'll oversee the SRE teams, driving productivity and ensuring... ...should have extensive experience in SRE, cloud technologies, and a strong educational...
Cloud
JCPenney
Dallas, TX
1 day ago
Senior DevOps Engineer — AWS, Kubernetes, Cloud Infra Remote
TBK Bank, SSB is seeking a Senior DevOps Engineer in Dallas, TX, or remotely to shape the cloud infrastructure strategy, focusing on AWS and Kubernetes to drive platform reliability for a fast-growing FinTech organization. The ideal candidate will have at least 5 years...
Cloud
Remote job
TBK Bank, SSB
Dallas, TX
3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to AI Infra SRE: GPU Cloud & Kubernetes Reliability. Be the first to apply!