Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

RDMA Ops Engineer - Computing Infrastructure Networking

$104.4k - $171k

Alibaba Cloud

Overview

We're seeking a skilled RDMA Ops Engineer to optimize and maintain high-performance networking infrastructure for our computing clusters. This role focuses on building and operating ultra-low latency, high-throughput networks using RDMA technologies to power next-generation computing workloads.

Responsibilities
  • Deploy, operate and maintain RDMA-based network architectures (RoCE/InfiniBand) for cluster with thousands of nodes
  • Optimize network performance for distributed collective communication workloads (NCCL, MPI, etc.)
  • Solve complex network issues in distributed collective communication (e.g., NCCL/MPI communication bottlenecks)
  • Use automation tools for network provisioning, monitoring, diagnostics, and network performance profiling (latency/throughput analysis)
  • Implement CI/CD pipelines for network infrastructure-as-code
  • Manage end-to-end network lifecycle: deployment, configuration, monitoring, upgrades
  • Collaborate with computing algorithm engineers to troubleshoot network-related bottlenecks in training/inference pipelines
  • Bridge Computing framework requirements with underlying network infrastructure capabilities
  • Ensure compliance with security and scalability requirements
Qualifications
  • Strong scripting skills (Python/Go/Bash) for operational automation
  • Expert-level RDMA operational experience (RoCEv2/InfiniBand)
  • Understanding of Linux internals (kernel bypass, syscall optimization, etc), and proficient in Linux network stack tuning (irqbalance, NUMA, hugepages)
  • Hands-on experience with RDMA/DPDK performance tuning
  • Strong knowledge of network protocols (TCP/IP, RoCEv2) and NIC architecture principles
  • Ability to abstract complex technical concepts into architectural diagrams
  • Proven track record of translating R&D innovations into production solutions
  • Strong communication skills for cross-functional collaboration with Computing researchers and SRE teams
  • Experience managing production computing networks
  • Familiar with Kubernetes networking (CNI, Multus, SR-IOV) and GPU-aware scheduling
  • Background in computing system optimization (NVIDIA collective libraries, MPI tuning)
  • Deep understanding of computing workload patterns and their network implications
Compensation and Employment

The pay range for this position at commencement of employment is expected to be between $104,400 and $171,000/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. If hired, employee will be in an “at-will position” and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

#J-18808-Ljbffr
Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the RDMA Ops Engineer - Computing Infrastructure Networking in Sunnyvale, CA vacancy
  • $104.4k - $171k

     ...A leading cloud service provider is seeking an RDMA Ops Engineer to optimize high-performance networking infrastructure for computing clusters. Responsibilities include deploying RDMA-based network architectures and optimizing performance. The ideal candidate has strong... 
    Network

    Alibaba Cloud

    Sunnyvale, CA
    4 days ago
  • $198k - $326k

     ...Description LinkedIn is the world's largest professional network, built to create economic opportunity for every member...  ...business needs of the team. As a Sr. Staff Software Engineer of the Compute Infrastructure team at LinkedIn, you will play a crucial role in our... 
    Network
    For contractors
    Work at office
    Flexible hours

    LinkedIn

    Mountain View, CA
    5 days ago
  • $200k - $400k

     ...A dedicated research lab is seeking a Network Engineer to design and optimize low-latency, high-bandwidth networking solutions for AI supercomputing...  .... The ideal candidate has strong experience with NVIDIA RDMA technologies, networking protocols, and Kubernetes. This role... 
    Network

    Institute of Foundation Models

    Sunnyvale, CA
    4 days ago
  • $174k - $252k

    Senior Software Engineer, Google Cloud Compute Infrastructure Benefits for this role include: Health, dental, vision, life, disability insurance Retirement...  ..., distributed computing, large-scale system design, networking and data storage, security, artificial intelligence,... 
    Network
    Full time
    Temporary work
    Worldwide

    Reporter Newspapers

    Sunnyvale, CA
    3 days ago
  • $188k - $275k

     ...CoreWeave combines superior infrastructure performance with deep technical...  ...breakthroughs and turn compute into capability. Founded in...  ...What You'll Do: The Field Engineering organization at CoreWeave is...  ...-up (IT service, break-fix, network, and firmware), and standing... 
    Network
    Permanent employment
    Full time
    Contract work
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    21 hours ago
  • $136.8k - $359.72k

     ...Senior Software Engineer - Compute Infrastructure (Orchestration & Scheduling) Location: San Jose Team: Infrastructure Employment Type:...  ...across heterogeneous resources—including CPU, GPU, memory, network, and power across global data centers. Lead Infrastructure... 
    Network
    Temporary work
    Overseas

    ByteDance

    San Jose, CA
    4 days ago
  • $156k - $387.6k

     ...About the Team The Inference Infrastructure team is the creator and open...  ...part of ByteDance's Core Compute Infrastructure organization,...  ...workloads, and are looking for engineers passionate about cloud-...  ...systems, and/or high-performance networking systems. - Hands-on... 
    Network
    Temporary work
    Local area

    ByteDance

    San Jose, CA
    2 days ago
  •  ...Clara is looking for a Cloud Managed Services Engineer to provide end-to-end management and technical support for networking problems. The role involves diagnosing issues...  .... Ideal candidates should possess a BS in Computer Science with 8+ years of experience, expertise... 
    Network
    Flexible hours

    Versa Networks

    Santa Clara, CA
    1 day ago
  • $165k - $242k

     ...Join to apply for the Senior Platform Engineer II, Compute Services role at CoreWeave ....  ...enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise...  ...distributed systems. Knowledge of network protocols and distributed consensus... 
    Network
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    4 days ago
  • $165k - $242k

     ...Senior Platform Engineer II, Compute Services Livingston, NJ / New York, NY / Sunnyvale, CA...  ...enterprises, CoreWeave combines superior infrastructure performance with deep technical...  ...distributed systems. Knowledge of network protocols and distributed consensus algorithms... 
    Network
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    2 days ago
  •  ...Principal DevOps Engineer (Cloud Ops) Location: Palo Alto The Position...  ...world's largest business network community to connect and collaborate...  ...and maintaining Docker infrastructure for micro services...  ...is desired BS or MS in Computer Science or related field... 
    Network

    Netpace

    Palo Alto, CA
    2 days ago
  • $88.4k - $143k

     ...Technical Support Engineer, Cortex Cloud Compute Our Mission Being the cybersecurity partner of choice, protecting our digital way of life...  ...evaluation criteria for obtaining results. You’ll enjoy networking with key contacts outside your own area of expertise, with... 
    Network
    Shift work

    Palo Alto Networks

    Santa Clara, CA
    4 days ago
  • $100k

     ...Engineer, SoC Infrastructure Santa Clara, California, United States Tenstorrent is leading the...  ...efficiency. With AI redefining the computing paradigm, solutions must evolve to unify...  ...models, compilers, platforms, networking, and semiconductors. Our diverse team... 
    Network
    Permanent employment

    Tenstorrent

    Santa Clara, CA
    4 days ago
  • $184k - $287.5k

    ## Senior Software Engineer, DGX Cloud AI InfrastructureApplylocations...  ..., distributed computing, and large-scale...  ...-scale AI clusters, infrastructure, and end-to-end...  ...across compute, memory, networking, and communication layers...  ...familiarity with the RDMA software stack (NCCL,... 
    Network
    Remote work

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $175k - $210k

     ...Infrastructure Engineer Forward Networks is transforming how the world's most complex networks are managed and secured. Founded in 2...  ...have ample opportunities to learn. Both Dev and Ops work is in scope. Storage, Compute, Network, Cloud, or Application? If you said '... 
    Network
    Work experience placement
    Work at office
    2 days per week

    Forward Networks Inc

    Santa Clara, CA
    2 days ago
  •  ...Sr. Engineer, Performance Infrastructure Austin, Texas, United States Tenstorrent is leading the industry...  ...efficiency. With AI redefining the computing paradigm, solutions must evolve to...  ...models, compilers, platforms, networking, and semiconductors. Our diverse team... 
    Network

    Tenstorrent

    Santa Clara, CA
    2 days ago
  • $272k - $431.25k

    NVIDIA has been transforming computer graphics, PC gaming, and...  ...Principal Rack Scale Systems Infrastructure Engineer, you will build and guide...  ...firmware, OS lifecycle, and networking fabrics. Your task is to compose...  ...as Ethernet, InfiniBand, RDMA, and fabric‑level... 
    Network
    Shift work

    NVIDIA Corporation

    Santa Clara, CA
    19 hours ago
  • $150k - $275k

     ...A leading AI infrastructure company based in San Jose is seeking a highly skilled Supercomputing Engineer specialized in networking. This role involves developing high-performance networking...  ...strong C/C++ skills and experience with RDMA technologies. The position offers a... 
    Network
    Relocation package

    ETCHED LLC

    San Jose, CA
    4 days ago
  • $207k - $300k

    Staff Software Engineer, ML, Compute Platform Sunnyvale, CA, USA Advanced Experience owning outcomes...  .... Experience with diagnostics, networking and data analysis. About the job Google...  ..., image processing etc. The AI and Infrastructure team is redefining what’s possible.... 
    Network
    Full time
    Worldwide

    Google Inc.

    Sunnyvale, CA
    2 days ago
  • $200k - $400k

     ...data scientists, and engineers, tackling the most fundamental...  ...for high‑performance computing in deep learning,...  ..., high‑bandwidth networking solutions that power...  ...technologies such as NVIDIA’s RDMA‑capable solutions,...  ...pipelines through Infrastructure‑as‑Code (IaC) best... 
    Network
    Visa sponsorship

    Institute of Foundation Models

    Sunnyvale, CA
    4 days ago
  • $149.4k - $205.4k

     ...Staff HPC Infrastructure Engineer page is loaded## Staff HPC Infrastructure...  ...and improve the computational infrastructure. You...  ...work· Work with the networking infrastructure team...  ...experience· 2+ years of RDMA networking experience...  ...software release and ops processes and... 
    Network
    Work at office
    Remote work
    Work from home
    Flexible hours

    Guardant Health

    Palo Alto, CA
    4 days ago
  • Litmus is seeking an IT Systems Specialist for their Santa Clara HQ. The role involves managing the on-prem VMware infrastructure, office network operations, and IT support for employees. The successful candidate should have strong experience in VMware vSphere, networking... 
    Network
    Work at office

    Litmus

    Santa Clara, CA
    3 days ago
  •  ...Sr. Director Of Network Engineering At Oracle Cloud Infrastructure (OCI), we build the future of the cloud for Enterprises...  ...virtual network service teams compute and GPU product and engineering...  ...network, particularly in supporting RDMA interconnections. You will... 
    Network

    Oracle

    Santa Clara, CA
    3 days ago
  • $225k - $275k

     ...vertically integrated AI infrastructure company built from the ground...  ...time. The demand for AI compute is boundless, and power is...  ...is seeking a Senior Staff Network Deployment Engineer to serve as the technical...  ...Supply Chain, Data Center Ops, and Site Reliability leadership... 
    Network
    Temporary work
    Remote work

    Crusoe

    Sunnyvale, CA
    22 days ago
  • $141.91k - $200.34k

     ...Join an enthusiastic team of engineers in Intel's Networking Solutions Group (NSG)...  ...next generation programmable Infrastructure Processing Units (IPUs)...  ...Master's in Electrical or Computer engineering, Computer Science...  ...data center workloads, RDMA, collectives, and AI benchmarking... 
    Network
    Local area
    Immediate start
    Shift work

    Intel

    Santa Clara, CA
    3 days ago
  • $150k - $230k

     ...and veteran systems engineers who share a vision for...  ...foundations of distributed computing. As AI workloads grow...  ...complex, traditional infrastructure struggles to meet the...  ...systems, high-speed networking, and distributed...  ...performance networking (RDMA, InfiniBand) ML... 
    Network

    Clockwork Inc

    Palo Alto, CA
    4 days ago
  • $109.2k - $223.4k

     ...Principal Network Engineer We are the AI Infrastructure - Network Operations team at OCI. We support and operate the RDMA/RoCE network fabrics for OCI's largest AI and HPC customers....  ...of a large-scale global Oracle cloud computing environment (Oracle Cloud Infrastructure... 
    Network
    Temporary work
    Immediate start
    Flexible hours

    Oracle

    Santa Clara, CA
    4 days ago
  • $160.36k - $240.54k

     ...Software Engineer, ML Infrastructure Mountain View, California (HQ) Who We Are Nuro is a self...  ...engineers with seamless access to compute and data resources. You will be responsible...  ...of distributed systems, networking, and storage bottlenecks in the context... 
    Network

    Nuro

    Mountain View, CA
    2 days ago
  •  ...on AI solutions is seeking an experienced QA Engineer to test products across various platforms...  ...candidate must have a Bachelor's degree in Computer Science and at least 5 years of hands-on testing experience in networking technologies. Strong communication and debugging... 
    Network

    Nexthop Systems Inc

    Santa Clara, CA
    4 days ago
  • $94.16k - $141k

     ...building blocks of the data infrastructure that connects our world....  ...scale up and scale out networking, disaggregated memory, storage...  ...~ Master's degree in Computer Science, Computer Engineering, Electrical Engineering,...  ...protocols, including TCP/IP and RDMA Preferred... 
    Network
    Internship

    Marvell

    Santa Clara, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to RDMA Ops Engineer - Computing Infrastructure Networking. Be the first to apply!