Senior AI Infra SRE: Remote GPU Clusters & Performance
Cortes 23
- Remote job
Cortes 23 in San Francisco is seeking a Senior Site Reliability Engineer to design and operate large-scale GPU infrastructure. This high-impact role requires deep expertise in distributed systems and a proactive approach to incident management. The successful candidate will ensure reliability and performance, serving as a key technical liaison for customers managing large-scale AI workloads. The position offers the opportunity to shape foundational AI infrastructure within a dynamic team. #J-18808-Ljbffr Cortes 23
- A leading AI infrastructure company is looking for a Senior Site Reliability Engineer to design and operate large-scale GPU clusters. In this role, you will work closely with clients to troubleshoot... ...with GPU systems, high-performance networking, and Linux internals....SeniorPerformance
$250k
...engineer to design and maintain large-scale GPU clusters for training and inference. The candidate should have over 7 years in SRE or DevOps, with strong skills in... ...Experience with observability stacks and high-performance computing is preferred. The role offers an...SeniorPerformance- ...Associates Limited is looking for a Senior Storage Engineer to support large-scale AI infrastructure in San Francisco.... ...storage solutions for high-performance GPU platforms. The ideal candidate has... ..., including stock options and remote working options. #J-18808-Ljbffr...Remote jobSeniorPerformance
$152k - $241.5k
NVIDIA Corporation is seeking a Senior ML Platform Engineer to design and scale high-performance ML infrastructure. You'll utilize IaC techniques with Ansible and Terraform... ...role demands 5+ years in platform engineering or SRE, a solid understanding of ML workflows, and...Remote jobSeniorPerformance- ...Gruppe in Santa Clara is seeking a Senior Software Engineer to lead the... ...training across large-scale GPU platforms. Candidates should have substantial experience in AI applications and technical... ...actionable insights to drive performance improvements. You will also mentor...SeniorPerformance
$272k - $431.25k
NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California... ...the efficiency of AI/ML research on GPU Clusters. The role involves collaboration with... ...teams, monitoring infrastructure performance, and implementing improvements based on...Performance$272k - $431.25k
We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will... ...roadmaps for such initiatives. Monitor and optimize the performance of our infrastructure ensuring high availability,...Performance$250k
...a rapidly scaling AI cloud infrastructure... ...a next-generation GPU platform designed for... ...is looking for a Senior / Staff Site Reliability... ..., scalability, and performance of HPC and cloud... ...for GPU compute clusters Collaborate with... ...options Bonus Remote working option and...Remote workSeniorPerformancePermanent employment$168k - $258.75k
A leading AI technology company in Santa Clara is seeking a Senior Datacenter Technical Program Manager. In this role, you will drive the integration of cutting... ...candidate has 8+ years of experience in high-performance computing, excellent teamwork skills, and a background...Remote jobSeniorPerformance$300k
...Francisco seeks a Platform Engineer/Senior Site Reliability Engineer to manage their AI and cloud platform. You will design and maintain large-scale GPU clusters, create automation pipelines, and... ...have over 7 years of experience in SRE or DevOps, strong skills in Kubernetes...Senior- ...a Member of Technical Staff in AI Supercomputing to design, build, and operate a GPU supercomputing environment. You... ...scale research by ensuring high-performance computing efficiency. The ideal... ...strong background in operating GPU clusters, container orchestration, and deep...SeniorPerformance
- NVIDIA Gruppe in Santa Clara is seeking a technical leader for the GPU AI/HPC Infrastructure team. You will design and implement cutting-edge GPU compute clusters, focusing on deep learning and high-performance computing. The ideal candidate will have at least 5+ years of...SeniorPerformance
$165k - $225k
...Sr. Site Reliability Engineer (SRE) Chicago, IL or Remote Moonlite delivers high-performance AI infrastructure for organizations... ...and building production-grade clusters from the ground up (not just deploying... ...SR-IOV for high-performance GPU interconnects, multi-tenancy...Remote workSeniorPerformanceFlexible hours- Advanced Micro Devices in Austin, TX, is seeking a GPU Cluster Network Performance Attainment Engineer. This role focuses on optimizing GPU cluster performance with a strong emphasis on RDMA networks. The ideal candidate will have extensive experience in GPU architectures...SeniorPerformance
$152k - $241.5k
...in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual... ...on large multi-GPU and multi-node clusters. Study the interaction of our... ...existing vacancy. NVIDIA uses AI tools in its recruiting processes....Remote workSeniorPerformance- Sciforium is looking for a Senior HPC & GPU Infrastructure Engineer to oversee our GPU compute cluster’s health, reliability, and performance. This role involves hands-on Linux systems engineering, GPU driver management, and maintaining machine learning software stacks...SeniorPerformanceFlexible hours
$168k - $258.75k
Senior Datacenter Technical Program Manager, At-Scale AI Clusters page is loaded## Senior Datacenter Technical... ...Santa Clara: US, CA, Remote: US, Remotetime type:... ...and deploy large scale GPU computing systems based... ...Experience with high-performance computing systems and...Remote workSeniorPerformanceFor contractors$200k - $400k
Inferact is seeking a dedicated cluster administration engineer to manage high-performance GPU compute infrastructure in San Francisco. This hands-on role focuses... ...clusters and strong Linux admin skills. Exceptional remote candidates may also be considered. The...Remote jobSeniorPerformance- NVIDIA Gruppe is seeking an experienced Senior Production Engineer to manage GPU clusters for AI workloads and enhance reliability and scalability. This role... ...implementing monitoring capabilities, and ensuring optimal performance through incident management. Competitive salary...SeniorPerformance
- Krämer IT Solutions GmbH sucht einen AI Engineer / DevOps für unsere Saar-Cloud in Deutschland. Du baust den Maschinenraum für die KI von morgen und optimierst unsere GPU-Cluster für bestmögliche Performance. Du hast Erfahrung mit Docker und Kubernetes, und deine Aufgaben...Remote jobPerformanceFlexible hours
- ...team The Advertising Performance group focuses on... ...Reinforcement Learning, AI, Control and... ...talented and experienced Senior Software Engineer,... ...in DevOps/SRE practices, cloud infrastructure... ...and GCP, including GPU/TPU-based training... ...are flexible for remote work except for...Remote workSeniorPerformanceWork at officeLocal areaMonday to ThursdayFlexible hours
- Senior Infrastructure Engineer - Bland As a Senior Infrastructure... ...industries. Lead - AI/ML Stack Infrastructure... ...production Kubernetes clusters optimized for AI/ML workloads with GPU support, implementing container... ...and monitoring for model performance and drift....SeniorPerformanceTemporary work
$146k - $194k
...is powered by Lattice OS, an AI-powered operating system that... ...ROLE Anduril is seeking a High Performance Computing (HPC) System Engineer... ...Architect and deploy advanced GPU infrastructure, leading the design... ...-user login environments, and cluster management software (e.g.,...SeniorPerformanceFull timeWork experience placementImmediate start- ...NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team in Santa Clara, CA. In... ...enhance efficiency by addressing infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML research. The ideal...
- ...a technical role in Dallas, Texas, focused on maintaining GPU clusters and AI workloads. Candidates should possess strong Linux and scripting... ...and Grafana, automating workflows, and troubleshooting performance issues. Familiarity with GPU workloads and distributed training...Performance
- ...leading tech firm is seeking a talented Senior Staff Software Engineer to design and... ...Data Center Compute racks. This remote role requires expertise in GPU programming and LINUX driver development, with a focus on performance and efficiency. Candidates should have...Remote jobSeniorPerformance
- A healthcare technology company based in San Francisco is seeking an experienced Site Reliability Engineer (SRE) to ensure the reliability and performance of their systems. Candidates should have over 5 years of professional engineering experience, strong cloud environment...Remote jobSeniorPerformanceFlexible hours
$170k - $290k
...Senior Site Reliability Engineer Luma's mission is to build multimodal AI to expand human imagination and capabilities... ..., reliable, and performant GPU infrastructure that... ...boundaries of scale. Our SRE team is the... ...just maintain existing clusters; you will help define...Remote workSeniorPerformanceWork experience placement- ...is seeking an Infrastructure/Cluster Engineer to design and operate... ...large-scale clusters that enable AI inference at scale. The role... ...Responsibilities include debugging performance issues and designing... ...cluster health. Experience with GPU infrastructure is a plus. #J-...Performance
- Jobgether is seeking a Senior Site Reliability Engineer (SRE) based in Germany, focusing on maintaining and improving cloud infrastructure reliability, scalability, and performance. You will enhance critical services in a fast-paced environment, ensuring smooth operations...SeniorPerformance
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior AI Infra SRE: Remote GPU Clusters & Performance. Be the first to apply!
- senior data management analyst San Francisco, CA
- senior app developer San Francisco, CA
- senior game producer San Francisco, CA
- senior retail sales associate San Francisco, CA
- senior manager quality engineering San Francisco, CA
- senior software test automation engineer San Francisco, CA
- senior quantitative risk analyst San Francisco, CA
- senior broker San Francisco, CA
- senior compensation manager San Francisco, CA
- senior sourcing engineer San Francisco, CA

