AI Infra SRE: GPU Cloud & Kubernetes Reliability
Virtual Tech Gurus
Virtual Tech Gurus is hiring for a technical role in Dallas, Texas, focused on maintaining GPU clusters and AI workloads. Candidates should possess strong Linux and scripting skills, as well as experience with Kubernetes in production settings. Responsibilities include monitoring systems with tools like Prometheus and Grafana, automating workflows, and troubleshooting performance issues. Familiarity with GPU workloads and distributed training systems is preferred. #J-18808-Ljbffr Virtual Tech Gurus
- Responsibilities Maintain reliability of GPU clusters and AI workloads Monitor systems (Prometheus, Grafana) Automate provisioning and recovery workflows... ...Strong Linux + scripting (Python/Bash) Experience with Kubernetes (production environments) Observability tools experience...Cloud
- NorthMark Compute and Cloud LLC is seeking an HPC Kubernetes Solutions Architect to provide customer guidance in designing and integrating GPU-accelerated Kubernetes platforms tailored for HPC. This role requires deep technical expertise in Kubernetes and strong engagement...Cloud
- NorthMark Compute & Cloud (NMC²) is backed by... ...Manager, HPC Kubernetes Platform to lead the... ...orchestration layer powering GPU- and CPU-intensive... ...performance, reliability, and automation. You... ...infrastructure engineering, AI systems, and high-... ...-Code, CI/CD, and SRE best practices....Cloud
- ...Job Position:- Site Reliability Engineer Duration:... ...Site Reliability Engineer (SRE) with a strong background in Google Cloud Platform (GCP), and... ...services (Compute Engine, Kubernetes Engine, Cloud Storage, BigQuery... ...with Google BI and AI/ML tools (Looker,...Cloud
- Role: Senior SRE Engineer Location: Washington DC - Hybrid Job Description... ...and applications, leveraging Davis AI and Grail to drive proactive reliability, mentoring cross-functional DevOps teams... ...in a mission-critical, multi-cloud landscape. Core Responsibilities Enterprise...CloudWork from homeFlexible hours
- ...exceptional Principal Site Reliability Engineer to architect, design, and build our SRE foundation from the... ...environments Google Cloud Infrastructure Excellence... ...Leverage AI and machine learning for... ...containerization (Docker, Kubernetes) and orchestration platforms...CloudRemote work
- ...Site Reliability Engineer III There's nothing more exciting... ...Office (CDAO) AI/ML & Data Platforms team... ...solutions. Through code and cloud infrastructure, you... ...Databricks, Snowflake, AWS, and Kubernetes Collaborates with... ...environment to support SRE workflows with strong...CloudWork at office
- ...Lead Site Reliability Engineer As a Lead Site Reliability... ...within Enterprise technology AI/ML Data Platforms team,... ..., Snowflake, AWS, Kubernetes, etc. Coordinate incident... ...~10+ years in an SRE or production support role with AWS Cloud, Databricks, Snowflake or...Cloud
- ...looking for a Manager, Site Reliability Engineering to be part... ...looking for a hands‑on SRE leader to build and... ...Senior Director of SRE & Cloud Operations, you'll... ...with increasing use of AI and automation to get there... ...work. Experience with Kubernetes container orchestration...Cloud
$400 per month
...seeking experienced candidates to support a project with a leading AI research lab. This role involves completing and evaluating... ...infrastructure processes, candidates should have 2+ years in DevOps, SRE, or Cloud Engineering, and experience with key cloud tools. The...Cloud- ...Engineering Manager, AI Compute Platform (... ...delivering GPU-as-a-Service (GPUaaS... ...building a bare-metal Kubernetes platform optimized... ...infrastructure, HPC, and cloud-like service... ...culture of ownership, reliability, and continuous improvement... ...) Automation, SRE & Platform...CloudRelocationFlexible hours
- System Reliability Engineer (SRE) 1 —> 3 to 5 years experience Location :- Kansas City, Mi or Atlanta... ...architecture, infrastructure, and cloud technologies. Proficiency in scripting... ...technologies (e.g., Docker, Kubernetes). Familiarity with infrastructure as...Cloud
- ...are seeking a Manager, Cloud Engineering to lead the... ..., and enabling secure, reliable, and cost‑effective infrastructure... ...application, platform, SRE, security, and network... .... Experience with Kubernetes and container platforms... ...patterns. Exposure to GPU, AI/ML, or high‑performance...CloudLocal area
- ...that enable private cloud and hybrid cloud platforms... ...supporting AI‑driven workloads, GPU‑based platforms, and... ...observability into broader SRE and platform monitoring... ...insights to improve reliability, capacity planning, and... ...supporting Kubernetes, OpenShift, or cloud‑...Cloud
- ...application development and AI/ML, and our people-first culture... ...We are looking for a Site Reliability Engineer to ensure the reliability... ...operation of a multi-cloud application security platform... ...automation, with a focus on Kubernetes, Terraform, CI/CD, and CSPM technologies...CloudWork at officeRemote workVisa sponsorshipWork visaFlexible hours
- Goldman Sachs is seeking a motivated Cloud Site Reliability Engineer (SRE) in Dallas, Texas. The candidate will be responsible for ensuring the resilience... ...Key responsibilities include defining SLOs, implementing AI-driven observability, and managing microservices...CloudFull time
- ...Techniques is seeking a skilled professional in Dallas, Texas, to design and optimize GPU-accelerated container platforms. The ideal candidate will have expertise in NVIDIA and Kubernetes ecosystems, with a focus on high-performance workloads. This role includes...Work at officeRelocation package3 days per week
- ...enabling enterprise-scale cloud and DevOps... ...Automation, DevSecOps, AI/ML Operations (... ...Deliver DevSecOps and Infra metrics. Team... ...DevOps — Terraform — Kubernetes — Docker —Ansible... ...quality, platform reliability, and operational efficiency... ...management, and SRE best practices. Drive...CloudH1b
- ...• Lead complex technology Cloud initiatives including those... ...offerings such as Compute and AI & ML on GCP and/or Azure... ..., DevOps concepts, and Site Reliability Engineer (SRE) principles • Proficient... ...have handled 2-3 large scale Kubernetes based infrastructure build out...CloudWork experience placement
$129.1k - $189.34k
...Financial Issuance as a Service (IFIaaS) Cloud Service includes a wide array of... ...in an on-prem environment. The Sr. Site Reliability Engineer (SRE) will be responsible for ensuring that... ...providing an integrated platform of scalable, AI-enabled security offerings. We enable...CloudWork at officeLocal areaRemote workRelocationFlexible hours3 days per week$151.04k - $234.11k
## Senior DevOps Engineer - AWS, Kubernetes, Cloud Infra - Intelligence - RemoteApplylocations: Remote - United Statestime type: Full timeposted... ...cloud platforms, automating infrastructure, and driving reliability across mission-critical systems? We're looking for a...CloudRemote job- ...Infrastructure & Operations Engineer – AI/GPU Cloud (US, Remote) We're hiring on behalf... ...networking Leading incident response, driving SRE practices (SLOs/SLIs, observability,... ...SRE/DevOps chops: Terraform, Ansible, Kubernetes, Prometheus, Grafana, Python/Bash ~...CloudPermanent employmentRemote workDay shift
$100.6k - $199k
...generation of workloads to our Public Cloud platform. We work together... ...help you develop skills in AI infrastructure, Cloud services... ...into Microsoft. As a SRE II in Azure Specialized, you will... ...will improve the availability, reliability, efficiency, observability, and...CloudOngoing contractLocal area- ...Position Title: Sr. AI Developer/Architect Location... ...Microsoft Azure AI & Cloud Services Azure Azure Data... ...Learning (Azure ML) Kubernetes / AKS Infrastructure as... ...Bicep) LLM Claude Code GPU Workloads Cloud Security... ...detection, and operational reliability. Collaboration &...CloudFull timeRemote workMonday to Friday
$130k - $200k
...Mission 4Minds is an enterprise AI fine-tuning platform that... ...deployed on‑prem or on your cloud provider. Our patented... ...stack, from inference pipeline reliability to GPU performance optimization across... ...and deploy models using Kubernetes and Docker to ensure scalable...CloudWork at officeFlexible hours- Job Description Cloud SRE Engineer - Associate Who We Look For Goldman Sachs Engineers are... .... We are seeking a motivated Cloud Site Reliability Engineer (SRE) to support the WM Data... ...stability. Predictive Observability: Implement AI-driven observability stacks (e.g.,...Cloud
- A leading technology company is looking for a System Reliability Engineer (SRE) 1 to ensure the reliability, scalability, and performance of their... ...the SRE role, strong knowledge of system architecture and cloud technologies, and proficiency in scripting languages like...Cloud
$103.5k - $172.5k
Overview SeniorManager, Site Reliability Engineering The Site Reliability... ...operational aspects, the SRE Sr.Manager actively contributes... ...telemetry, observability, and AI-driven monitoring solutions.... ...eCommerce platforms with one of the Cloud providers (AWS/Azure/Google...CloudContract workTemporary workShift work$103.5k - $172.5k
JCPenney is seeking a Senior Manager for Site Reliability Engineering in Dallas, Texas. In this hybrid leadership role, you'll oversee the SRE teams, driving productivity and ensuring... ...should have extensive experience in SRE, cloud technologies, and a strong educational...Cloud- TBK Bank, SSB is seeking a Senior DevOps Engineer in Dallas, TX, or remotely to shape the cloud infrastructure strategy, focusing on AWS and Kubernetes to drive platform reliability for a fast-growing FinTech organization. The ideal candidate will have at least 5 years...CloudRemote job
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to AI Infra SRE: GPU Cloud & Kubernetes Reliability. Be the first to apply!

