Senior Systems Software Engineer, GPU Compute
Nebius
About Nebius: Nebius is leading a new era in cloud infrastructure for the global AI economy. We are building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment, without the cost and complexity of building large in-house AI/ML infrastructure. Built by engineers, for engineers. From large-scale GPU orchestration to inference optimization, we own the hard problems across compute, storage, networking and applied AI. Listed on Nasdaq (NBIS) and headquartered in Amsterdam, we have a global footprint with R&D hubs across Europe, the UK, North America and Israel. Our team of 1,500+ includes hundreds of engineers with deep expertise across hardware, software and AI R&D.
THE ROLE
We’re looking for a Senior Software Systems Engineer to join our team and play a key role in the development of our cutting-edge hyperscaler platform. The GPU & InfiniBand team is responsible for enhancing and optimizing the core components of our Cloud platform, with a specific focus on GPU computing, InfiniBand networks, and the KVM/QEMU stack. You’ll work closely with hardware virtualization and device emulation technologies, ensuring high performance and security in multi-GPU, HPC environments. The role involves analyzing, troubleshooting, and improving infrastructure to support new hardware, fine-tuning system performance, and automating fault detection and resolution in a complex system. In this position, you will be responsible for: * Tuning the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments. * Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions. * Integrating new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM. * Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments. * Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation. We expect you to have: * 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming). * 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning). * In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems. * Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python). It would be a plus if you have: * Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking. * Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads). * Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication. * Background in Software-Defined Networking (SDN) and experience with HPC cluster networking. * Understanding of QEMU/KVM virtualization and managing virtualized environments. * Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems. * Familiarity with collective communication libraries like MPI and NCCL for distributed computing. We offer competitive salaries ranging from $170k-$300k + equity based on your experience. We conduct coding interviews as part of the process. Benefits & Perks:- Competitive compensation
- Career growth and learning opportunities
- Flexibility and work-life balance
- Collaborative and innovative culture
- Opportunity to work on impactful AI projects
- International environment and talented teams
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior Systems Software Engineer, GPU Compute. Be the first to apply!
- systems software developer United States
- IT system engineer United States
- IT system support engineer United States
- system programmer United States
- senior development executive United States
- senior technical manager United States
- senior medical writer United States
- senior procurement specialist United States
- senior software development engineer in test United States
- senior communications specialist United States
