System Engineer, GPU Fleet
$200k - $300kFluidstack
System Engineer, GPU Fleet
As a System Engineer, GPU Fleet, you will manage, operate, and optimize hyperscale GPU compute infrastructure supporting AI/ML training and inference workloads. Ensure high availability, performance, and reliability of GPU server fleet through automation, monitoring, troubleshooting, and collaboration with hardware engineering, platform teams, and datacenter operations.
Focus
- Operate and maintain large-scale GPU server fleet (H100, B200, GB200) supporting AI/ML workloads; monitor system health, performance, and utilization to maximize uptime and ensure SLA compliance
- Perform hands-on troubleshooting and root cause analysis of complex hardware, firmware, OS, and application issues across GPU clusters; coordinate with vendors and hardware teams to resolve systemic failures
- Develop and maintain automation scripts for provisioning, configuration management, monitoring, and remediation at scale.
- Build and improve tooling for GPU health checks, performance diagnostics, driver validation, and automated recovery
- Execute server provisioning, configuration, firmware updates, and OS installation using automation frameworks; manage lifecycle operations including deployment, maintenance, and decommissioning
- Participate in 24x7 on-call rotation; respond to production incidents and coordinate resolution with cross-functional teams including datacenter operations, network engineering, and application teams
- Lead post-incident reviews, document root causes, and drive continuous improvement initiatives focused on automation, reliability, monitoring, and operational efficiency
Basic Qualifications
- Bachelor's degree in Computer Science, Engineering, or related technical field (or equivalent practical experience)
- 3+ years (System Engineer) or 5+ years (Senior System Engineer) in Linux system administration, datacenter operations, or infrastructure engineering
- Strong Linux/Unix fundamentals including system administration, shell scripting (Bash, Python), troubleshooting, and performance tuning
- Experience with server hardware architecture, troubleshooting techniques, and understanding of compute, memory, storage, and networking components
- Experience in automation and configuration management tools (Ansible, Puppet, Chef, Terraform).
- Strong analytical and problem-solving skills with ability to diagnose complex technical issues under pressure
- Excellent communication and collaboration skills; ability to work effectively with cross-functional teams
Preferred Qualifications
- Experience managing large-scale GPU infrastructure (NVIDIA H100, A100, B200, GB200) in production environments supporting AI/ML workloads
- Deep knowledge of GPU architecture, CUDA toolkit, GPU drivers, monitoring tools (nvidia-smi, DCGM)
- Experience with HPC cluster management, job schedulers (Slurm, PBS, LSF), and container orchestration (Kubernetes, Docker)
- Proficiency in out-of-band management protocols (IPMI, Redfish, BMC) and firmware management for server hardware
- Experience with high-performance networking (InfiniBand, RoCE, RDMA) and network troubleshooting in GPU cluster environments
- Familiarity with datacenter operations including rack installations, cabling, power management, and thermal considerations
Salary & Benefits
- Competitive total compensation package (salary + equity).
- Retirement or pension plan, in line with local norms.
- Health, dental, and vision insurance.
- Generous PTO policy, in line with local norms.
The base salary range for this position is $200,000 - $300,000 per year, depending on experience, skills, qualifications, and location. This range represents our good faith estimate of the compensation for this role at the time of posting. Total compensation may also include equity in the form of stock options.
We are committed to pay equity and transparency.
Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
$165k - $242k
...Systems Kernel Engineer Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA CoreWeave is... ...(networking, storage, virtualization, GPU/DPU enablement). Stack-Wide Support... ...observability. Work closely with HPC and Fleet teams to ensure kernel readiness for...FleetPermanent employmentTemporary workCasual workWork at officeRemote workFlexible hours$153k - $242k
...Senior Systems Engineer, OS Automation CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a... ...automate reproducible OS image build pipelines for our massive fleet of GPU-accelerated servers. ~ Kernel Distribution: Collaborate...FleetPermanent employmentTemporary workCasual workWork at officeLocal areaRemote workFlexible hours$65 - $85 per hour
...Overview Role Summary: Chemistry Systems Engineer contributing and being a key technical member of the Chemistry and System Engineering team... ...and delivery of major and complex assignments supporting the fleet operating BWRs and the new BWRX-300 (small modular reactor...FleetHourly payContract workRemote work- ...substantially faster and cheaper by owning and operating our own global fleet of bare-metal machines rather than renting generic cloud VMs.... ...exceptional team. Blacksmith was founded by a team with deep systems and scaling experience, including building search/ads...FleetSecond job
$65k - $75k
...Robotics is looking for a Robot Operations Engineer in the United States to oversee the... ...reliability and performance of the Tally robot fleet. The role includes remote monitoring,... ...in a technical role, knowledge of Linux systems, and proficiency in scripting. The position...FleetRemote work$150k - $200k
...excellence. The Role: As a foundational engineer on our Corporate Technology team, you... ...architect and steward of the internal systems that empower our employees and secure our... ...Active Directory, Entra ID) and manage our fleet of company assets through modern MDM...FleetWork at office$88.4k - $154.7k
...Our NYC Transit and Rail team is seeking a Transit & Rail Systems Engineer / Vehicle Integration Lead to support major rolling stock, CBTC... ...signaling or train control support, testing and commissioning, fleet planning, or operational analysis. ~ Ability to coordinate...FleetLocal areaWorldwideFlexible hours$62.44k - $88.12k
...This opportunity resides with Warfare Systems (WS) , a business group within HII's Mission... ..., network architecture, reverse engineering, software and hardware development uniquely... ...and synthetic training environments to fleet sustainment, environmental remediation and...FleetFull timeWork experience placementWork at officeLocal areaWorldwide$130k - $150k
...We are seeking a hands-on Linux Systems Engineer to build, maintain, and scale our on-premise infrastructure. This role is focused on bare... ...patching, upgrades, and capacity planning across a growing server fleet Participate in incident response, root cause analysis, and...FleetCasual work- ...Principal Systems Engineer (HPC, Python/Go) New York, NY (Hybrid, 3 days in office) Highly competitive compensation package Join... ...design and execution of high-impact projects for a distributed fleet of 10,000+ compute servers. You will drive decisions on hardware...FleetWork at office
$77.5k - $120k
...This opportunity resides with Warfare Systems (WS), a business group within HII's Mission... ..., network architecture, reverse engineering, software and hardware development uniquely... ...and synthetic training environments to fleet sustainment, environmental remediation and...FleetFull timeWork experience placementInterim roleWork at officeLocal areaWorldwide- ...intelligence and environmental monitoring. We are building a fleet of high-altitude balloon systems (HAPS-inspired platforms) equipped with imaging,... ..., and climate resilience. This is a high-stakes engineering initiative operating at the intersection of aerospace...FleetRemote workFlexible hours
$150k - $250k
Hudson River Trading (HRT) is seeking a Windows Systems Engineer to join our Enterprise Engineering team. In this role, you will help scale... ...code, and infrastructure-as-code solutions to seamlessly manage fleets of top-of-the-line user endpoints, mission critical servers,...FleetFull timeWork at officeLocal areaImmediate startRemote work$127k - $249k
...The Team Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational... ...and internal service mesh), and observability and alerting systems. The Fleet Management team provides the core runtime environment that...FleetWork at officeLocal areaRemote workWorldwideFlexible hours$185k - $227k
...purpose and we are hiring the world’s best engineers, scientists, designers, product managers,... ...governance Configure and manage AWS Systems Manager (SSM) including Session Manager,... ...Manager, and Automation for centralized fleet operations Implement centralized logging...FleetRemote work$200k - $300k
...Hudson River Trading (HRT) is seeking a Software Engineer focused on GPU reliability to join our Systems Development team. The Systems Development team builds... ...improve observability, reliability, and efficiency of the fleet. You'll work closely with other engineering teams to...FleetWork at officeLocal areaImmediate start$130k - $250k
...accuracy in complex, noisy environments - at fleet scale. RADAR is one of the best-funded... ...it. You'll work closely with hardware engineers, firmware developers, backend and cloud... ...Required : You have expertise with RFID systems — UHF Gen2, reader configuration, antenna...FleetRemote workFlexible hours- ...Senior Client Platform Systems Engineer Beast Industries is a multifaceted media and entertainment company founded by Jimmy Donaldson,... ...Windows environments. This role exists to ensure our device fleet operates as resilient, compliant, and scalable infrastructure...FleetRelocation packageFlexible hours3 days per week
$109k - $160k
...GPU Infrastructure Software Engineer Sunnyvale, CA CoreWeave is The Essential Cloud for AI™. Built for pioneers... ..., reliability, and scalability of systems that power CoreWeave's... ...performance testing. You'll partner with fleet, product, and hardware teams to evolve...FleetPermanent employmentTemporary workCasual workWork at officeRemote workFlexible hours- ...-edge technology company is looking for exceptional generalist engineers who thrive with autonomy. This fully remote role allows you to... ...optimizing CUDA kernels to designing distributed orchestration systems. Ideal candidates will have a Bachelor's degree and a strong track...Remote work
- ...Conviction. Join us and help build the platform engineers turn to to ship AI products. At Baseten, we are building the global operating system for distributed, heterogeneous AI hardware... ...for foundational engineers to lead our GPU Networking efforts, making RDMA a first-...Flexible hours
- ...System Support Engineer As the leader in transit technology, Clever Devices' vision is to make meaningful contributions to worldwide mobility... ...logfiles and other inputs to proactively look for trends in fleet data. Also review reported issues for potentially larger issues...FleetWork at officeWorldwide
$78.55k - $112.22k
...Mid This opportunity resides with Warfare Systems (WS), a business group within HII’s... ...cybersecurity, network architecture, reverse engineering, software and hardware development... ...and synthetic training environments to fleet sustainment, environmental remediation and...FleetFull timeWork at officeLocal areaWorldwide- ...Systems Engineer 5 Location: New York - (Hybrid Role 2 days a week onsite) Duration: 2 Years Job Description: Summary: Senior Low... ...them. These will include High Performance Cooling Systems, FPGA, GPU, Microwaves. This candidate is expected to work directly with...Rotating shift2 days per week
$150k - $300k
...Hudson River Trading (HRT) is looking for Systems Engineers to join our growing Research & Development team. This team builds and maintains exceptionally... ...software, and development tools. We have incredibly large GPU and CPU compute clusters, larger than most national labs. We...Full timeWork at officeLocal areaImmediate startRemote workWorldwide- ...Greylock, and Conviction. Join us and help build the platform engineers turn to to ship AI products. THE ROLE As an OS / K8s Systems Engineer at Baseten, you'll build the automation and systems that turn raw GPU hardware into production-ready compute. From...Flexible hours
$118.39k - $176k
...GlobalFoundries US, Inc. Position Title: MTS Equipment Engineering (Litho Track systems owner) Salary: $118,394- $176,000 Hours: Monday -... ..., software bugs, and server and system issues across all fleets. Develop and train the maintenance technicians and new...FleetLocal areaMonday to Friday$100k - $135k
...operations and synthetic training environments to fleet sustainment, environmental remediation... ..., a Division of HII is seeking a Systems Administrator 3 to support Modeling, Simulation... ...processes for simulator technicians and engineers. Strong oral and written communication...FleetFull timeLocal areaWorldwide$140.83k - $166.22k
...Advanced Software Engineer - Revenue Systems Job ID: 14252 Business Unit: MTA Headquarters Location: New York, NY, United States Regular... .... The MTA network comprises the nation’s largest bus fleet and more subway and commuter rail cars than all other U.S....FleetContract workTemporary workFor contractorsWork at office- ...fal is looking for an Engineering Manager to lead their Fleet Reliability team in the United States. In this role, you will hire and develop personnel to ensure the reliability of GPU nodes. Responsibilities include setting SLAs, driving automation initiatives, and managing...Fleet
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to System Engineer, GPU Fleet. Be the first to apply!
- broadcast systems engineer New York, NY
- office 365 systems engineer New York, NY
- operations support system engineer New York, NY
- microsoft systems engineer New York, NY
- system safety engineer New York, NY
- ground systems engineer New York, NY
- mission system engineer New York, NY
- unix linux systems engineer New York, NY
- wireless systems engineer New York, NY
- space systems engineer New York, NY


