Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior AI Infra SRE: Remote GPU Clusters & Performance

Cortes 23

Cortes 23 in San Francisco is seeking a Senior Site Reliability Engineer to design and operate large-scale GPU infrastructure. This high-impact role requires deep expertise in distributed systems and a proactive approach to incident management. The successful candidate will ensure reliability and performance, serving as a key technical liaison for customers managing large-scale AI workloads. The position offers the opportunity to shape foundational AI infrastructure within a dynamic team. #J-18808-Ljbffr Cortes 23

Vacancy posted 3 days ago
Similar jobs that could be interesting for youBased on the Senior AI Infra SRE: Remote GPU Clusters & Performance in San Francisco, CA vacancy
  • A leading AI infrastructure company is looking for a Senior Site Reliability Engineer to design and operate large-scale GPU clusters. In this role, you will work closely with clients to troubleshoot...  ...with GPU systems, high-performance networking, and Linux internals.... 
    Senior
    Performance

    Andromeda

    San Francisco, CA
    4 days ago
  • $250k

     ...engineer to design and maintain large-scale GPU clusters for training and inference. The candidate should have over 7 years in SRE or DevOps, with strong skills in...  ...Experience with observability stacks and high-performance computing is preferred. The role offers an... 
    Senior
    Performance

    Hamilton Barnes Associates Limited

    San Francisco, CA
    2 days ago
  •  ...Associates Limited is looking for a Senior Storage Engineer to support large-scale AI infrastructure in San Francisco....  ...storage solutions for high-performance GPU platforms. The ideal candidate has...  ..., including stock options and remote working options. #J-18808-Ljbffr... 
    Remote job
    Senior
    Performance

    Hamilton Barnes Associates Limited

    San Francisco, CA
    1 day ago
  • $152k - $241.5k

    NVIDIA Corporation is seeking a Senior ML Platform Engineer to design and scale high-performance ML infrastructure. You'll utilize IaC techniques with Ansible and Terraform...  ...role demands 5+ years in platform engineering or SRE, a solid understanding of ML workflows, and... 
    Remote job
    Senior
    Performance

    NVIDIA

    Santa Clara, CA
    4 days ago
  •  ...Gruppe in Santa Clara is seeking a Senior Software Engineer to lead the...  ...training across large-scale GPU platforms. Candidates should have substantial experience in AI applications and technical...  ...actionable insights to drive performance improvements. You will also mentor... 
    Senior
    Performance

    NVIDIA Gruppe

    Santa Clara, CA
    2 days ago
  • $272k - $431.25k

    NVIDIA Corporation seeks a Principal AI and ML Infra Software Engineer in Santa Clara, California...  ...the efficiency of AI/ML research on GPU Clusters. The role involves collaboration with...  ...teams, monitoring infrastructure performance, and implementing improvements based on... 
    Performance

    NVIDIA

    Santa Clara, CA
    4 days ago
  • $272k - $431.25k

    We are seeking a Principal AI and ML Infra Software Engineer, GPU Clusters at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will...  ...roadmaps for such initiatives. Monitor and optimize the performance of our infrastructure ensuring high availability,... 
    Performance

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $250k

     ...a rapidly scaling AI cloud infrastructure...  ...a next-generation GPU platform designed for...  ...is looking for a Senior / Staff Site Reliability...  ..., scalability, and performance of HPC and cloud...  ...for GPU compute clusters Collaborate with...  ...options Bonus  Remote working option and... 
    Remote work
    Senior
    Performance
    Permanent employment
    San Francisco, CA
    27 days ago
  • $168k - $258.75k

    A leading AI technology company in Santa Clara is seeking a Senior Datacenter Technical Program Manager. In this role, you will drive the integration of cutting...  ...candidate has 8+ years of experience in high-performance computing, excellent teamwork skills, and a background... 
    Remote job
    Senior
    Performance

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $300k

     ...Francisco seeks a Platform Engineer/Senior Site Reliability Engineer to manage their AI and cloud platform. You will design and maintain large-scale GPU clusters, create automation pipelines, and...  ...have over 7 years of experience in SRE or DevOps, strong skills in Kubernetes... 
    Senior

    Hamilton Barnes Associates Limited

    San Francisco, CA
    5 hours ago
  •  ...a Member of Technical Staff in AI Supercomputing to design, build, and operate a GPU supercomputing environment. You...  ...scale research by ensuring high-performance computing efficiency. The ideal...  ...strong background in operating GPU clusters, container orchestration, and deep... 
    Senior
    Performance

    Radical Numerics Inc.

    San Francisco, CA
    2 days ago
  • NVIDIA Gruppe in Santa Clara is seeking a technical leader for the GPU AI/HPC Infrastructure team. You will design and implement cutting-edge GPU compute clusters, focusing on deep learning and high-performance computing. The ideal candidate will have at least 5+ years of... 
    Senior
    Performance

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  • $165k - $225k

     ...Sr. Site Reliability Engineer (SRE) Chicago, IL or Remote Moonlite delivers high-performance AI infrastructure for organizations...  ...and building production-grade clusters from the ground up (not just deploying...  ...SR-IOV for high-performance GPU interconnects, multi-tenancy... 
    Remote work
    Senior
    Performance
    Flexible hours

    Moonlite AI

    United States
    3 days ago
  • Advanced Micro Devices in Austin, TX, is seeking a GPU Cluster Network Performance Attainment Engineer. This role focuses on optimizing GPU cluster performance with a strong emphasis on RDMA networks. The ideal candidate will have extensive experience in GPU architectures... 
    Senior
    Performance

    Advanced Micro Devices

    Austin, TX
    3 days ago
  • $152k - $241.5k

     ...in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual...  ...on large multi-GPU and multi-node clusters. Study the interaction of our...  ...existing vacancy. NVIDIA uses AI tools in its recruiting processes.... 
    Remote work
    Senior
    Performance

    NVIDIA

    Santa Clara, CA
    2 days ago
  • Sciforium is looking for a Senior HPC & GPU Infrastructure Engineer to oversee our GPU compute cluster’s health, reliability, and performance. This role involves hands-on Linux systems engineering, GPU driver management, and maintaining machine learning software stacks... 
    Senior
    Performance
    Flexible hours

    Sciforium

    San Francisco, CA
    1 day ago
  • $168k - $258.75k

    Senior Datacenter Technical Program Manager, At-Scale AI Clusters page is loaded## Senior Datacenter Technical...  ...Santa Clara: US, CA, Remote: US, Remotetime type:...  ...and deploy large scale GPU computing systems based...  ...Experience with high-performance computing systems and... 
    Remote work
    Senior
    Performance
    For contractors

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $200k - $400k

    Inferact is seeking a dedicated cluster administration engineer to manage high-performance GPU compute infrastructure in San Francisco. This hands-on role focuses...  ...clusters and strong Linux admin skills. Exceptional remote candidates may also be considered. The... 
    Remote job
    Senior
    Performance

    Inferact

    San Francisco, CA
    5 hours ago
  • NVIDIA Gruppe is seeking an experienced Senior Production Engineer to manage GPU clusters for AI workloads and enhance reliability and scalability. This role...  ...implementing monitoring capabilities, and ensuring optimal performance through incident management. Competitive salary... 
    Senior
    Performance

    NVIDIA Gruppe

    California, MO
    1 day ago
  • Krämer IT Solutions GmbH sucht einen AI Engineer / DevOps für unsere Saar-Cloud in Deutschland. Du baust den Maschinenraum für die KI von morgen und optimierst unsere GPU-Cluster für bestmögliche Performance. Du hast Erfahrung mit Docker und Kubernetes, und deine Aufgaben... 
    Remote job
    Performance
    Flexible hours

    Server Eye

    New Bremen, OH
    15 hours ago
  •  ...team The Advertising Performance group focuses on...  ...Reinforcement Learning, AI, Control and...  ...talented and experienced Senior Software Engineer,...  ...in DevOps/SRE practices, cloud infrastructure...  ...and GCP, including GPU/TPU-based training...  ...are flexible for remote work except for... 
    Remote work
    Senior
    Performance
    Work at office
    Local area
    Monday to Thursday
    Flexible hours

    Roku, Inc.

    Austin, TX
    2 days ago
  • Senior Infrastructure Engineer - Bland As a Senior Infrastructure...  ...industries. Lead - AI/ML Stack Infrastructure...  ...production Kubernetes clusters optimized for AI/ML workloads with GPU support, implementing container...  ...and monitoring for model performance and drift.... 
    Senior
    Performance
    Temporary work

    AI Chopping Block, Inc.

    San Francisco, CA
    1 day ago
  • $146k - $194k

     ...is powered by Lattice OS, an AI-powered operating system that...  ...ROLE Anduril is seeking a High Performance Computing (HPC) System Engineer...  ...Architect and deploy advanced GPU infrastructure, leading the design...  ...-user login environments, and cluster management software (e.g.,... 
    Senior
    Performance
    Full time
    Work experience placement
    Immediate start

    Anduril Industries

    Costa Mesa, CA
    4 days ago
  •  ...NVIDIA Gruppe is seeking a Principal AI and ML Infra Software Engineer to join our Hardware Infrastructure team in Santa Clara, CA. In...  ...enhance efficiency by addressing infrastructure deficiencies for GPU Clusters, fostering innovations in AI/ML research. The ideal... 

    Jobleads-US

    Santa Clara, CA
    4 days ago
  •  ...a technical role in Dallas, Texas, focused on maintaining GPU clusters and AI workloads. Candidates should possess strong Linux and scripting...  ...and Grafana, automating workflows, and troubleshooting performance issues. Familiarity with GPU workloads and distributed training... 
    Performance

    Virtual Tech Gurus

    Dallas, TX
    2 days ago
  •  ...leading tech firm is seeking a talented Senior Staff Software Engineer to design and...  ...Data Center Compute racks. This remote role requires expertise in GPU programming and LINUX driver development, with a focus on performance and efficiency. Candidates should have... 
    Remote job
    Senior
    Performance

    Confidential Company

    Richardson, TX
    4 days ago
  • A healthcare technology company based in San Francisco is seeking an experienced Site Reliability Engineer (SRE) to ensure the reliability and performance of their systems. Candidates should have over 5 years of professional engineering experience, strong cloud environment... 
    Remote job
    Senior
    Performance
    Flexible hours

    Plenful

    San Francisco, CA
    15 hours ago
  • $170k - $290k

     ...Senior Site Reliability Engineer Luma's mission is to build multimodal AI to expand human imagination and capabilities...  ..., reliable, and performant GPU infrastructure that...  ...boundaries of scale. Our SRE team is the...  ...just maintain existing clusters; you will help define... 
    Remote work
    Senior
    Performance
    Work experience placement

    Luma AI

    United States
    4 days ago
  •  ...is seeking an Infrastructure/Cluster Engineer to design and operate...  ...large-scale clusters that enable AI inference at scale. The role...  ...Responsibilities include debugging performance issues and designing...  ...cluster health. Experience with GPU infrastructure is a plus. #J-... 
    Performance

    Linuxcareers

    San Francisco, CA
    2 days ago
  • Jobgether is seeking a Senior Site Reliability Engineer (SRE) based in Germany, focusing on maintaining and improving cloud infrastructure reliability, scalability, and performance. You will enhance critical services in a fast-paced environment, ensuring smooth operations... 
    Senior
    Performance

    Jobgether

    New Bremen, OH
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior AI Infra SRE: Remote GPU Clusters & Performance. Be the first to apply!