HPC/ML Infrastructure Engineer
Spellbrush
Experienced HPC Infrastructure Engineer
We're looking for an experienced HPC infrastructure engineer to lead bringup, administration, and operations on what is probably the largest anime AI training cluster in the world. You'll serve as the bridge between our researchers and the bare GPU machines, helping to make sure that SLURM jobs are running, parallel filesystems are serving, network is transmitting, and that the anime models are training.
Love for Anime and the Anime Aesthetic
This probably one of the only jobs in the world where you will get to combine your love of anime and large-scale GPU systems.
Familiarity with the Modern HPC Software Landscape
Once upon a time, our team could install SLURM on a few bare metal nodes and get away with it. Now the landscape has become unbelievably complex, with SLURM deploys through Slinky on K8s, provisioning through warewulf/MAAS/ansible, filesystems through WEKA/VAST/Ceph, VPN and access through tailscale, and monitoring via the Grafana/Prometheus stack. We're looking for someone with relevant experience up and down the stack (and maybe a papercut or two to show for it!)
Traditional Sysadmin Skills
Bringing up and managing cluster still requires good old Linux sysadmin skills, including wrangling LDAP, triaging dmesg, and setting sticky bits on directories for misbehaving users and tools.
Comfort with Physical Computers
We're building out edge datacenters and our CEO is still personally racking, stacking, and provisioning HGX-based nodes in our living room. Also his VLAN design sucks and he's bad at fiber routing. Please send help.
Working on Small, Fast-Paced Teams
We currently have a very tiny research team, and you'll be directly helping some of the AI researchers in the world train the best anime image model in the world. We also believe in the unmatched speed of in-person teams, and prefer on-site collaboration in either our primary research office in Tokyo (downtown Akihabara), or San Francisco (dogpatch!). Bay area is strongly preferred as we have physical hardware in the Bay Area. Visa sponsorships are available.
- ...Ipro Networks Pte. Ltd. is seeking a Machine Learning Engineer with expertise in high-performance computing systems to manage and optimize their infrastructure for ML model training and deployment. The ideal candidate has a Bachelor's degree in Computer Science and experience...SuggestedFull time
$227.2k - $324.5k
...Corporation. About the Role: This Software Engineering team works closely with Machine Learning... ...and low latency. Work with ML engineers to understand their challenges... ...Familiarity with the machine‑learning infrastructure. Previous experience with Akka. Ability...SuggestedFull timeFlexible hours- ...Whatnot is seeking an AI/ML Platform Engineer to shape the future of machine learning within a fast-growing livestream shopping platform. In this role, you'll design and scale systems that support various business functions, prototype novel architectures, and build robust...SuggestedRemote work
$250k - $350k
...Description Most AI roles build on top of models. This one builds what makes them actually work. We're hiring ML Infrastructure Engineers to tackle a hard, real-world problem, understanding what's happening on live job sites using wearable devices, large-...Suggested$200k - $280k
...Engineering San Francisco Full-time $200,000 - $280,000 About the Role Join our ML Infrastructure team to build the systems that train, deploy, and serve our AI models at scale. You'll work at the intersection of machine learning and systems engineering. What You Will...SuggestedFull timeWork at office$320k - $405k
...growing group of committed researchers, engineers, policy experts, and business leaders working... ...We are seeking a Machine Learning Infrastructure Engineer to join our Safeguards organization... ...team, you'll design and implement ML infrastructure that powers Claude safety...Work at officeVisa sponsorshipFlexible hours- ...ML Infrastructure Engineer Spectral Labs is a spatial intelligence company building reasoning models for engineering physical systems. Our model SGS-1 is state-of-the-art for parametric geometry, and we are currently building the next generation of models to revolutionize...
$100k - $200k
...Coval Simulation & Evaluation that scales voice and chat AI agents ML‑Infrastructure Engineer Salary $100K - $200K Equity 0.20% - 1.00% Location San Francisco, CA, US Job type Full‑time Role Engineering, Backend Experience 1+ years Visa US citizen/visa only Skills Torch...Full timeLive inWork at office- ...Physics | 5 Days Onsite Machine Learning Infrastructure Engineer Location: Onsite in San Francisco... ...building/operating infrastructure for ML/compute-heavy workflows: pipelines, job... ...). Experience with simulation/HPC pipelines (CFD, meshing, batch workloads...Work at officeFlexible hours1 day per week
- ...driving innovation through advanced hardware engineering and AI solutions. Our mission is to... ...We are seeking a Senior Machine Learning Infrastructure Engineer to join our team. The person... ...shaping a high-performance, production-grade ML ecosystem to support rapid...Flexible hours
- ...Sciforium is an AI infrastructure company developing next-generation multimodal... ...hands‑on support from AMD engineers the team is scaling rapidly... ...role We are seeking a Senior HPC & GPU Infrastructure Engineer... ...bring‑up to maintaining the ML software stack (CUDA/ROCm, PyTorch...Flexible hours
- ...The problem we saw Most AI infrastructure is built for batch: send a query, wait, get a response, reset. Powerful, but transactional... ...generation of AI inference infrastructure. As our ML Infrastructure and Platform Engineer, you will own the architecture and scaling of our GPU...Flexible hoursShift work
$300k - $430k
...— shape how we work and grow as a team. About the Team The ML Infrastructure team builds the systems that power every stage of Decagon's... ...use. About the Role We're hiring a Staff ML Infrastructure Engineer to own the platforms powering Decagon's model training and inference...Work at office- A dynamic AI company is seeking an Infrastructure Software Engineer in San Francisco to build and maintain components of an ML inference platform. The successful candidate will develop infrastructure components using Python and Go, manage Kubernetes deployments, and enhance...
- Repovive, Inc. seeks an experienced ML Engineer to build infrastructure for fraud detection and bank intelligence at Plaid. The role requires a minimum of 5 years of applied ML experience and emphasizes expertise in ML graph embeddings and feature stores. Interested candidates...
- ...A healthcare technology firm in San Francisco is seeking an ML Infrastructure Engineer, Model Inference to build and optimize AI-driven solutions. You will design scalable Kubernetes clusters, enhance ML model serving infrastructure, and collaborate with cross-functional...
- ...Cartesia is looking for a Software Engineer to build the data infrastructure for its AI models in San Francisco. In this hands-on role, you will design and... ...audio. Candidates should have experience with ML data systems and demonstrate modern engineering execution...Work at office
- ...About the Role We are seeking a Data Infrastructure Engineer to build and operate the infrastructure that turns drone, aerial, and orbital... ...'ll Do Design, build, and operate scalable data and ML infrastructure on AWS, including workloads running on Kubernetes...Permanent employmentFull time
- ...A progressive technology company in San Francisco is looking for a Data Infrastructure Engineer to design and operate data and ML infrastructure on AWS. The ideal candidate will have strong software engineering fundamentals and experience building production systems, particularly...
- ...A frontier research laboratory in San Francisco is seeking a Senior / Principal ML Engineer to enhance their ML infrastructure. The role involves designing experimental frameworks for data scientists, collaborating with various teams, and ensuring rigorous practices in...
- ...A leading AI research organization in San Francisco seeks an Infrastructure Engineer to design and maintain large distributed ML training and inference clusters. The ideal candidate will have a strong grasp of optimizing training workloads and experience with distributed...
- ...An innovative networking company in San Francisco is seeking a Data Operations Engineer to develop systems that convert network engineering expertise into high-quality training data. This role calls for creativity in prototyping and a passion for user experience, where...
- 53 Stations in San Francisco is seeking a Backend Engineer to enhance network engineering tools. Your initial weeks will involve observing network engineers to understand their diagnostic processes and creating tools that empower them to generate training data independently...
- ...Xterraai is looking for an ML Software Engineer to help build innovative AI agents capable of tackling complex scientific challenges. The position involves designing and developing systems that support cutting-edge research in geospatial and geophysics intelligence. The...
- ...company which will de-risk the largest infrastructure build-out in history. When people finance... ...helping to shape culture, mentor junior engineers, and learn from our customers. About... ...fat-tree topologies You have built HPC network architectures (eBGP, fat-tree,...Long term contractContract workFixed term contractWork at officeLocal areaVisa sponsorshipShift work3 days per week
$205k - $235k
...Houston, Los Angeles, McLean, New York, Hoboken, Philadelphia, San Francisco, Seattle EY-Parthenon – EY Growth Platforms - AI ML Engineering – Director At EY-Parthenon, our unique blend of strategy, transactions and corporate finance, combined with cutting‑edge AI...Full timeFor contractorsWork experience placementSummer holidayFlexible hours$250k - $325k
...Every civilization runs on the same infrastructure: agreements between people who don't fully... ...grown 800% over the last 12 months. Engineering at Ivo Engineers at Ivo are inventors... ...→ prod) Design strategies to isolate ML vs API workloads while optimizing for cost...Contract workWork at officeRemote work- ...A blockchain analytics company in San Francisco is seeking a Senior Software Engineer, ML Infrastructure to design and operate GPU-backed systems for AI. The ideal candidate will have 5+ years of experience in building distributed infrastructure and a bachelor’s degree...
- ...Delphina-Hotels- is looking for an experienced ML Infrastructure Engineer to join their Technical Staff in San Francisco. In this pivotal role, you will help shape product direction and drive key technical decisions. Your responsibilities will include developing platforms...
- ...The Role: Why, What and the Who Infrastructure Engineers build the foundation for Ivo’s entire platform. Customers are cagey about their contracts... ...(dev → staging → prod). Design strategies to isolate ML vs API workloads while optimizing for cost, performance, and...
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to HPC/ML Infrastructure Engineer. Be the first to apply!
- computer vision machine learning engineer San Francisco, CA
- machine learning ai engineer San Francisco, CA
- senior ml engineer San Francisco, CA
- machine learning software engineer San Francisco, CA
- data scientist machine learning engineer San Francisco, CA
- machine learning engineer San Francisco, CA
- ai ml engineer San Francisco, CA
- junior machine learning research engineer San Francisco, CA
- graduate machine learning engineer San Francisco, CA
- entry level infrastructure engineer San Francisco, CA


