Senior GPU Infrastructure & Automation Engineer
Fal
You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You write systems and tooling for managing 1000s of servers including provisioning, health monitoring, error detection, and recovery — and when something breaks that automation can’t fix, you drive resolution with partners. Key responsibilities Build and maintain Python fleet tracking system that manages the full lifecycle of servers including contracting and procurement, target use, pricing, availability, health, RMAs, etc Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting Create and maintain metrics, dashboards, and alerting for hardware health across the fleet (GPU errors, disk failures, network issues, thermals) Leverage AI to an extreme level to build tools and automate alerting and recovery Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes) Develop a suite of automated error detection and recovery processes Work with partners to solve technical issues Requirements 5+ years experience managing bare-metal and VM server fleets at scale (100+ nodes) Strong software engineering skills in Python; you write production tooling, not scripts Deep Linux systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning Familiarity with hardware diagnostics and failure modes (GPUs, NVMe, NICs, memory) Experience building internal tools or dashboards for infrastructure visibility Excellent communication and ability to drive technical decisions across teams Self-starter who executes quickly, takes ownership, and constantly seeks improvement Nice to have Familiarity with network configuration and diagnostics (VLAN, VXLAN, ECMP, BGP, tcpdump) Experience with NVIDIA GPU infrastructure: driver management, health monitoring, DCGM, NVLink/NVSwitch diagnostics, RDMA, InfiniBand/RoCEv2 Experience with AMD GPUs Experience with bare metal and VM provisioning (PXE/iPXE, Kickstart, libvirt, Qemu/KVM) Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001) Location Turkey What we offer at fal Interesting and challenging work A lot of learning and growth opportunities #J-18808-Ljbffr
- ...Overview We’re looking for a skilled and versatile Software Engineer to lead the integration of our AI platform into local customer... ...startups or fast-paced product teams Familiarity with cloud infrastructure (AWS, GCP) and containerized deployments (Docker, Kubernetes)...SeniorLocal area
- A tech company in North Carolina is seeking an experienced automation engineer to design and implement automation scripts for UI, API, and DB validations. The ideal candidate will possess over 4 years in Java automation frameworks, along with strong expertise in Selenium...Suggested
- A leading e-commerce personalization provider is seeking a top-notch lead back-end developer to enhance their platform. The role includes developing RESTful services for high-volume traffic and implementing Big Data solutions. Candidates should have over 10 years of Java...Senior
- Edwards Lifesciences Gruppe is seeking a Senior THV Clinical Specialist in Turkey to provide advanced clinical support for Transcatheter Heart Valve therapies. This role requires at least five years of experience, extensive travel for clinical procedures, and collaboration...Senior
- ...Duration 3 months Description Duration 15 minutes Description IDE Duration 1 year 24 days Description • Bachelor’s degree in Computer Engineering, Software Engineering, or related fields. • At least 6 years of experience in application development on the .NET platform (.NET...Senior
- ...international experience in a corporate environment. These are the locally spoken languages that you’ll have to know. Industrial engineering, Mechanical engineering, Electrical engineering These are the backgrounds this opportunity is associated with. This would be your...
- ...The Senior THV Clinical Specialist will be responsible for providing advanced clinical support and strategic guidance for Edwards... ...Bachelor’s degree in a health‑related field (Pharmacy, Biomedical Engineering, etc.). Minimum five years of experience in clinical support...SeniorFor contractorsWork at officeLocal area
- Clinical Project Manager (with Senior CRA Responsibilities) As an employee of our company, you will collaborate with each departmentto create and deploy disruptive products. Come work at a growing company that offers great benefits with opportunities to move forward and...SeniorFull timeContract work
- ...energy management solutions that contribute to the sustainable development of communities and industries. Job Description Senior Commissioning Engineer will be responsible for overseeing all commissioning-related activities, from the initial design phase through to final...SeniorContract workFor subcontractor
- ...Develop and publish npm packages (semantic versioning, changelogs, automated releases) Build Front-End CI/CD pipelines with GitHub... ...vanilla JS) ~ BSc or MSc degree in Computer Science, Computer Engineering, Software Engineering or related disciplines from reputable universities...Senior
- ...Department: Product Reports To: CEO Location: Remote (Turkey strongly preferred / UTC+3) Why This Role Exists We have a talented engineering team and junior PM support, but we lack a strategic anchor . Currently, the CEO is the only bridge between business goals and technical...SeniorRemote work
- ...A medical institution in Clinton, North Carolina is seeking an Imaging Service Engineer III to manage and repair medical imaging equipment. The ideal candidate possesses strong expertise in electromechanical systems, excellent communication skills, and at least five years...Senior
- Who We Are OPLOG is the tech engine behind seamless e‑commerce fulfillment for top brands... ...in‑house software with robotics and automation, we erase the line between the physical... .... About the Role Open to all levels of seniority—we hire for potential and impact, not years...SeniorWork at officeFlexible hours
$46.8k
A community health services organization in North Carolina is seeking a qualified individual to oversee the Family and Children Medicaid Unit. The role includes supervising staff, evaluating performance, and providing necessary training. Applicants should possess significant...Senior- ...conformance to established schedules. Collaborate closely with the preconstruction team (project managers, superintendents, project engineers, estimators) to ensure schedule information is clearly communicated to those performing the work. Analyze project requirements...SeniorTemporary workWork experience placementWork at officeFlexible hours
$72k - $106k
...Senior Internal Auditor Location: Fayettville/Raleigh, NC area Pay: $72K-$106k Schedule: On-site, M-F Our client is a $16B global manufacturing company. They are looking for you, an experienced Senior Internal Auditor who has experience with financial...SeniorRelocation package- A technology company is seeking a mobile app developer skilled in Kotlin and Swift to enhance their AI-driven MonoChat application across platforms. You will be responsible for creating stable and secure mobile applications while collaborating with backend teams. Candidates...Remote work
- ...accountabilities are defined in this section.*As a Senior Regional Engineering Manager, you’ll lead the development... ...(e.g., Reliability, Strategic, Automation and /or Project Engineering) to... ...by collaborating with the Infrastructure Group.* **Strategic Oversight & Continuous...SeniorContract workWork at officeLocal areaRelocationFlexible hours
- ...Job Description Job Description Process Engineer The Plant Process Engineer fulfills a key role in the Technical Operations Group, providing front-line engineering knowledge, diagnostic and troubleshooting leadership, and implementation of problem-solving corrective...Full time
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Senior GPU Infrastructure & Automation Engineer. Be the first to apply!
- senior software engineer remote Turkey, NC
- senior Turkey, NC
- automation manager Turkey, NC
- security infrastructure engineer
- infrastructure engineer ii
- principal infrastructure engineer
- associate infrastructure engineer
- senior IT infrastructure engineer
- lead infrastructure engineer
- remote infrastructure engineer

