Senior AI Infra SRE: Remote GPU Clusters & Performance

Cortes 23

Remote job

Cortes 23 in San Francisco is seeking a Senior Site Reliability Engineer to design and operate large-scale GPU infrastructure. This high-impact role requires deep expertise in distributed systems and a proactive approach to incident management. The successful candidate will ensure reliability and performance, serving as a key technical liaison for customers managing large-scale AI workloads. The position offers the opportunity to shape foundational AI infrastructure within a dynamic team. #J-18808-Ljbffr Cortes 23

Apply

Vacancy posted 3 days ago

Similar jobs that could be interesting for youBased on the Senior AI Infra SRE: Remote GPU Clusters & Performance in San Francisco, CA vacancy

Senior AI Infra SRE: GPU Clusters & High-Perf Networking
A leading AI infrastructure company is looking for a Senior Site Reliability Engineer to design and operate large-scale GPU clusters. In this role, you will work closely with clients to troubleshoot... ...with GPU systems, high-performance networking, and Linux internals....
Senior
Performance
Andromeda
San Francisco, CA
4 days ago
Senior SRE — AI GPU Infra for Large-Scale HPC (IPO Equity)
$250k
...engineer to design and maintain large-scale GPU clusters for training and inference. The candidate should have over 7 years in SRE or DevOps, with strong skills in... ...Experience with observability stacks and high-performance computing is preferred. The role offers an...
Senior
Performance
Hamilton Barnes Associates Limited
San Francisco, CA
2 days ago
Senior AI Storage Engineer - Remote GPU HPC Infra
...Associates Limited is looking for a Senior Storage Engineer to support large-scale AI infrastructure in San Francisco.... ...storage solutions for high-performance GPU platforms. The ideal candidate has... ..., including stock options and remote working options. #J-18808-Ljbffr...
Remote job
Senior
Performance
Hamilton Barnes Associates Limited
San Francisco, CA
1 day ago
Senior ML Platform Engineer Scale GPU AI Infra (Remote, Equity)
$152k - $241.5k
NVIDIA Corporation is seeking a Senior ML Platform Engineer to design and scale high-performance ML infrastructure. You'll utilize IaC techniques with Ansible and Terraform... ...role demands 5+ years in platform engineering or SRE, a solid understanding of ML workflows, and...
Remote job
Senior
Performance
NVIDIA
Santa Clara, CA
4 days ago
Senior AI Infra Engineer-Distributed GPU Clusters (Equity)
...Gruppe in Santa Clara is seeking a Senior Software Engineer to lead the... ...training across large-scale GPU platforms. Candidates should have substantial experience in AI applications and technical... ...actionable insights to drive performance improvements. You will also mentor...
Senior
Performance
NVIDIA Gruppe
Santa Clara, CA
2 days ago
Principal AI/ML Infra Engineer GPU Clusters & HPC
$272k - $431.25k
NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California... ...the efficiency of AI/ML research on GPU Clusters. The role involves collaboration with... ...teams, monitoring infrastructure performance, and implementing improvements based on...
Performance
NVIDIA
Santa Clara, CA
4 days ago
Principal AI and ML Infra Software Engineer, GPU Clusters
$272k - $431.25k
We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will... ...roadmaps for such initiatives. Monitor and optimize the performance of our infrastructure ensuring high availability,...
Performance
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Senior Site Reliability Engineer (GPU Clusters) - Hosting
$250k
...a rapidly scaling AI cloud infrastructure... ...a next-generation GPU platform designed for... ...is looking for a Senior / Staff Site Reliability... ..., scalability, and performance of HPC and cloud... ...for GPU compute clusters Collaborate with... ...options Bonus Remote working option and...
Remote work
Senior
Performance
Permanent employment
San Francisco, CA
27 days ago
Senior AI Datacenter TPM for At-Scale GPU Clusters (Remote)
$168k - $258.75k
A leading AI technology company in Santa Clara is seeking a Senior Datacenter Technical Program Manager. In this role, you will drive the integration of cutting... ...candidate has 8+ years of experience in high-performance computing, excellent teamwork skills, and a background...
Remote job
Senior
Performance
NVIDIA Corporation
Santa Clara, CA
1 day ago
Senior AI GPU Infra SRE - Scale, Automation & Equity
$300k
...Francisco seeks a Platform Engineer/Senior Site Reliability Engineer to manage their AI and cloud platform. You will design and maintain large-scale GPU clusters, create automation pipelines, and... ...have over 7 years of experience in SRE or DevOps, strong skills in Kubernetes...
Senior
Hamilton Barnes Associates Limited
San Francisco, CA
5 hours ago
Senior Staff, AI Supercomputing & GPU Clusters Lead
...a Member of Technical Staff in AI Supercomputing to design, build, and operate a GPU supercomputing environment. You... ...scale research by ensuring high-performance computing efficiency. The ideal... ...strong background in operating GPU clusters, container orchestration, and deep...
Senior
Performance
Radical Numerics Inc.
San Francisco, CA
2 days ago
Senior AI/HPC GPU Cluster Architect (Equity)
NVIDIA Gruppe in Santa Clara is seeking a technical leader for the GPU AI/HPC Infrastructure team. You will design and implement cutting-edge GPU compute clusters, focusing on deep learning and high-performance computing. The ideal candidate will have at least 5+ years of...
Senior
Performance
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Sr. Site Reliability Engineer (SRE)
$165k - $225k
...Sr. Site Reliability Engineer (SRE) Chicago, IL or Remote Moonlite delivers high-performance AI infrastructure for organizations... ...and building production-grade clusters from the ground up (not just deploying... ...SR-IOV for high-performance GPU interconnects, multi-tenancy...
Remote work
Senior
Performance
Flexible hours
Moonlite AI
United States
3 days ago
Senior AI GPU Cluster Performance Engineer
Advanced Micro Devices in Austin, TX, is seeking a GPU Cluster Network Performance Attainment Engineer. This role focuses on optimizing GPU cluster performance with a strong emphasis on RDMA networks. The ideal candidate will have extensive experience in GPU architectures...
Senior
Performance
Advanced Micro Devices
Austin, TX
3 days ago
Senior System Software Engineer - GPU Performance
$152k - $241.5k
...in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual... ...on large multi-GPU and multi-node clusters. Study the interaction of our... ...existing vacancy. NVIDIA uses AI tools in its recruiting processes....
Remote work
Senior
Performance
NVIDIA
Santa Clara, CA
2 days ago
Senior HPC & GPU Infra Engineer — Build Frontier AI
Sciforium is looking for a Senior HPC & GPU Infrastructure Engineer to oversee our GPU compute cluster’s health, reliability, and performance. This role involves hands-on Linux systems engineering, GPU driver management, and maintaining machine learning software stacks...
Senior
Performance
Flexible hours
Sciforium
San Francisco, CA
1 day ago
Senior Datacenter Technical Program Manager, At-Scale AI Clusters
$168k - $258.75k
Senior Datacenter Technical Program Manager, At-Scale AI Clusters page is loaded## Senior Datacenter Technical... ...Santa Clara: US, CA, Remote: US, Remotetime type:... ...and deploy large scale GPU computing systems based... ...Experience with high-performance computing systems and...
Remote work
Senior
Performance
For contractors
NVIDIA Corporation
Santa Clara, CA
1 day ago
Senior GPU Compute Infra Engineer (Remote US)
$200k - $400k
Inferact is seeking a dedicated cluster administration engineer to manage high-performance GPU compute infrastructure in San Francisco. This hands-on role focuses... ...clusters and strong Linux admin skills. Exceptional remote candidates may also be considered. The...
Remote job
Senior
Performance
Inferact
San Francisco, CA
5 hours ago
Senior DGX Cloud Production Engineer — AI Infra & SRE
NVIDIA Gruppe is seeking an experienced Senior Production Engineer to manage GPU clusters for AI workloads and enhance reliability and scalability. This role... ...implementing monitoring capabilities, and ensuring optimal performance through incident management. Competitive salary...
Senior
Performance
NVIDIA Gruppe
California, MO
1 day ago
AI Systems Engineer — Remote, GPU Cluster & LLM Deployments
Krämer IT Solutions GmbH sucht einen AI Engineer / DevOps für unsere Saar-Cloud in Deutschland. Du baust den Maschinenraum für die KI von morgen und optimierst unsere GPU-Cluster für bestmögliche Performance. Du hast Erfahrung mit Docker und Kubernetes, und deine Aufgaben...
Remote job
Performance
Flexible hours
Server Eye
New Bremen, OH
15 hours ago
Senior Machine Learning Engineer, DevOps/SRE
...team The Advertising Performance group focuses on... ...Reinforcement Learning, AI, Control and... ...talented and experienced Senior Software Engineer,... ...in DevOps/SRE practices, cloud infrastructure... ...and GCP, including GPU/TPU-based training... ...are flexible for remote work except for...
Remote work
Senior
Performance
Work at office
Local area
Monday to Thursday
Flexible hours
Roku, Inc.
Austin, TX
2 days ago
Senior AI/ML Infra & SRE Engineer
Senior Infrastructure Engineer - Bland As a Senior Infrastructure... ...industries. Lead - AI/ML Stack Infrastructure... ...production Kubernetes clusters optimized for AI/ML workloads with GPU support, implementing container... ...and monitoring for model performance and drift....
Senior
Performance
Temporary work
AI Chopping Block, Inc.
San Francisco, CA
1 day ago
Senior HPC Systems Engineer - GPU & AI Clusters
$146k - $194k
...is powered by Lattice OS, an AI-powered operating system that... ...ROLE Anduril is seeking a High Performance Computing (HPC) System Engineer... ...Architect and deploy advanced GPU infrastructure, leading the design... ...-user login environments, and cluster management software (e.g.,...
Senior
Performance
Full time
Work experience placement
Immediate start
Anduril Industries
Costa Mesa, CA
4 days ago
Principal AI/ML Infra Engineer for GPU Clusters
...NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team in Santa Clara, CA. In... ...enhance efficiency by addressing infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML research. The ideal...
Jobleads-US
Santa Clara, CA
4 days ago
AI Infra SRE: GPU Cloud & Kubernetes Reliability
...a technical role in Dallas, Texas, focused on maintaining GPU clusters and AI workloads. Candidates should possess strong Linux and scripting... ...and Grafana, automating workflows, and troubleshooting performance issues. Familiarity with GPU workloads and distributed training...
Performance
Virtual Tech Gurus
Dallas, TX
2 days ago
Remote AI GPU Senior Staff Software Engineer (Kernel)
...leading tech firm is seeking a talented Senior Staff Software Engineer to design and... ...Data Center Compute racks. This remote role requires expertise in GPU programming and LINUX driver development, with a focus on performance and efficiency. Candidates should have...
Remote job
Senior
Performance
Confidential Company
Richardson, TX
4 days ago
Senior SRE - Healthcare AI Platform (Remote)
A healthcare technology company based in San Francisco is seeking an experienced Site Reliability Engineer (SRE) to ensure the reliability and performance of their systems. Candidates should have over 5 years of professional engineering experience, strong cloud environment...
Remote job
Senior
Performance
Flexible hours
Plenful
San Francisco, CA
15 hours ago
Senior Site Reliability Engineer
$170k - $290k
...Senior Site Reliability Engineer Luma's mission is to build multimodal AI to expand human imagination and capabilities... ..., reliable, and performant GPU infrastructure that... ...boundaries of scale. Our SRE team is the... ...just maintain existing clusters; you will help define...
Remote work
Senior
Performance
Work experience placement
Luma AI
United States
4 days ago
AI Infra & Cluster Engineer — Scale GPU/CPU Orchestration
...is seeking an Infrastructure/Cluster Engineer to design and operate... ...large-scale clusters that enable AI inference at scale. The role... ...Responsibilities include debugging performance issues and designing... ...cluster health. Experience with GPU infrastructure is a plus. #J-...
Performance
Linuxcareers
San Francisco, CA
2 days ago
Senior SRE — AI Cloud Infra, Scale & Resilience
Jobgether is seeking a Senior Site Reliability Engineer (SRE) based in Germany, focusing on maintaining and improving cloud infrastructure reliability, scalability, and performance. You will enhance critical services in a fast-paced environment, ensuring smooth operations...
Senior
Performance
Jobgether
New Bremen, OH
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior AI Infra SRE: Remote GPU Clusters & Performance. Be the first to apply!