Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

HPC/ML Infrastructure Engineer

Spellbrush

Experienced HPC Infrastructure Engineer

We're looking for an experienced HPC infrastructure engineer to lead bringup, administration, and operations on what is probably the largest anime AI training cluster in the world. You'll serve as the bridge between our researchers and the bare GPU machines, helping to make sure that SLURM jobs are running, parallel filesystems are serving, network is transmitting, and that the anime models are training.

Love for Anime and the Anime Aesthetic

This probably one of the only jobs in the world where you will get to combine your love of anime and large-scale GPU systems.

Familiarity with the Modern HPC Software Landscape

Once upon a time, our team could install SLURM on a few bare metal nodes and get away with it. Now the landscape has become unbelievably complex, with SLURM deploys through Slinky on K8s, provisioning through warewulf/MAAS/ansible, filesystems through WEKA/VAST/Ceph, VPN and access through tailscale, and monitoring via the Grafana/Prometheus stack. We're looking for someone with relevant experience up and down the stack (and maybe a papercut or two to show for it!)

Traditional Sysadmin Skills

Bringing up and managing cluster still requires good old Linux sysadmin skills, including wrangling LDAP, triaging dmesg, and setting sticky bits on directories for misbehaving users and tools.

Comfort with Physical Computers

We're building out edge datacenters and our CEO is still personally racking, stacking, and provisioning HGX-based nodes in our living room. Also his VLAN design sucks and he's bad at fiber routing. Please send help.

Working on Small, Fast-Paced Teams

We currently have a very tiny research team, and you'll be directly helping some of the AI researchers in the world train the best anime image model in the world. We also believe in the unmatched speed of in-person teams, and prefer on-site collaboration in either our primary research office in Tokyo (downtown Akihabara), or San Francisco (dogpatch!). Bay area is strongly preferred as we have physical hardware in the Bay Area. Visa sponsorships are available.

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the HPC/ML Infrastructure Engineer in San Francisco, CA vacancy
  •  ...Ipro Networks Pte. Ltd. is seeking a Machine Learning Engineer with expertise in high-performance computing systems to manage and optimize their infrastructure for ML model training and deployment. The ideal candidate has a Bachelor's degree in Computer Science and experience... 
    Suggested
    Full time

    Ipro Networks Pte. Ltd.

    San Francisco, CA
    4 days ago
  • $227.2k - $324.5k

     ...Corporation. About the Role: This Software Engineering team works closely with Machine Learning...  ...and low latency. Work with ML engineers to understand their challenges...  ...Familiarity with the machine‑learning infrastructure. Previous experience with Akka. Ability... 
    Suggested
    Full time
    Flexible hours

    Tubi Tv

    San Francisco, CA
    3 days ago
  •  ...Whatnot is seeking an AI/ML Platform Engineer to shape the future of machine learning within a fast-growing livestream shopping platform. In this role, you'll design and scale systems that support various business functions, prototype novel architectures, and build robust... 
    Suggested
    Remote work

    Whatnot

    San Francisco, CA
    3 days ago
  • $250k - $350k

     ...Description Most AI roles build on top of models. This one builds what makes them actually work. We're hiring ML Infrastructure Engineers to tackle a hard, real-world problem, understanding what's happening on live job sites using wearable devices, large-... 
    Suggested

    techire ai

    San Francisco, CA
    11 hours ago
  • $200k - $280k

     ...Engineering San Francisco Full-time $200,000 - $280,000 About the Role Join our ML Infrastructure team to build the systems that train, deploy, and serve our AI models at scale. You'll work at the intersection of machine learning and systems engineering. What You Will... 
    Suggested
    Full time
    Work at office

    Lattice

    San Francisco, CA
    4 days ago
  • $320k - $405k

     ...growing group of committed researchers, engineers, policy experts, and business leaders working...  ...We are seeking a Machine Learning Infrastructure Engineer to join our Safeguards organization...  ...team, you'll design and implement ML infrastructure that powers Claude safety... 
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    11 hours ago
  •  ...ML Infrastructure Engineer Spectral Labs is a spatial intelligence company building reasoning models for engineering physical systems. Our model SGS-1 is state-of-the-art for parametric geometry, and we are currently building the next generation of models to revolutionize... 

    Spectral Labs

    San Francisco, CA
    11 days ago
  • $100k - $200k

     ...Coval Simulation & Evaluation that scales voice and chat AI agents ML‑Infrastructure Engineer Salary $100K - $200K Equity 0.20% - 1.00% Location San Francisco, CA, US Job type Full‑time Role Engineering, Backend Experience 1+ years Visa US citizen/visa only Skills Torch... 
    Full time
    Live in
    Work at office

    Voiceflow

    San Francisco, CA
    4 days ago
  •  ...Physics | 5 Days Onsite Machine Learning Infrastructure Engineer Location: Onsite in San Francisco...  ...building/operating infrastructure for ML/compute-heavy workflows: pipelines, job...  ...). Experience with simulation/HPC pipelines (CFD, meshing, batch workloads... 
    Work at office
    Flexible hours
    1 day per week

    UniversalAGI

    San Francisco, CA
    3 days ago
  •  ...driving innovation through advanced hardware engineering and AI solutions. Our mission is to...  ...We are seeking a Senior Machine Learning Infrastructure Engineer to join our team. The person...  ...shaping a high-performance, production-grade ML ecosystem to support rapid... 
    Flexible hours

    Echo Neurotechnologies

    San Francisco, CA
    3 days ago
  •  ...Sciforium is an AI infrastructure company developing next-generation multimodal...  ...hands‑on support from AMD engineers the team is scaling rapidly...  ...role We are seeking a Senior HPC & GPU Infrastructure Engineer...  ...bring‑up to maintaining the ML software stack (CUDA/ROCm, PyTorch... 
    Flexible hours

    Sciforium

    San Francisco, CA
    4 days ago
  •  ...The problem we saw Most AI infrastructure is built for batch: send a query, wait, get a response, reset. Powerful, but transactional...  ...generation of AI inference infrastructure. As our ML Infrastructure and Platform Engineer, you will own the architecture and scaling of our GPU... 
    Flexible hours
    Shift work

    U-Run

    San Francisco, CA
    4 days ago
  • $300k - $430k

     ...— shape how we work and grow as a team. About the Team The ML Infrastructure team builds the systems that power every stage of Decagon's...  ...use. About the Role We're hiring a Staff ML Infrastructure Engineer to own the platforms powering Decagon's model training and inference... 
    Work at office

    Decagon

    San Francisco, CA
    3 days ago
  • A dynamic AI company is seeking an Infrastructure Software Engineer in San Francisco to build and maintain components of an ML inference platform. The successful candidate will develop infrastructure components using Python and Go, manage Kubernetes deployments, and enhance... 

    Baseten

    San Francisco, CA
    4 days ago
  • Repovive, Inc. seeks an experienced ML Engineer to build infrastructure for fraud detection and bank intelligence at Plaid. The role requires a minimum of 5 years of applied ML experience and emphasizes expertise in ML graph embeddings and feature stores. Interested candidates... 

    Repovive, Inc.

    San Francisco, CA
    2 days ago
  •  ...A healthcare technology firm in San Francisco is seeking an ML Infrastructure Engineer, Model Inference to build and optimize AI-driven solutions. You will design scalable Kubernetes clusters, enhance ML model serving infrastructure, and collaborate with cross-functional... 

    Abridge

    San Francisco, CA
    4 days ago
  •  ...Cartesia is looking for a Software Engineer to build the data infrastructure for its AI models in San Francisco. In this hands-on role, you will design and...  ...audio. Candidates should have experience with ML data systems and demonstrate modern engineering execution... 
    Work at office

    Cartesia, Inc.

    San Francisco, CA
    3 days ago
  •  ...About the Role We are seeking a Data Infrastructure Engineer to build and operate the infrastructure that turns drone, aerial, and orbital...  ...'ll Do Design, build, and operate scalable data and ML infrastructure on AWS, including workloads running on Kubernetes... 
    Permanent employment
    Full time

    Matter Intelligence

    San Francisco, CA
    11 hours ago
  •  ...A progressive technology company in San Francisco is looking for a Data Infrastructure Engineer to design and operate data and ML infrastructure on AWS. The ideal candidate will have strong software engineering fundamentals and experience building production systems, particularly... 

    Matter Intelligence

    San Francisco, CA
    3 days ago
  •  ...A frontier research laboratory in San Francisco is seeking a Senior / Principal ML Engineer to enhance their ML infrastructure. The role involves designing experimental frameworks for data scientists, collaborating with various teams, and ensuring rigorous practices in... 

    Merge Labs, Inc.

    San Francisco, CA
    4 days ago
  •  ...A leading AI research organization in San Francisco seeks an Infrastructure Engineer to design and maintain large distributed ML training and inference clusters. The ideal candidate will have a strong grasp of optimizing training workloads and experience with distributed... 

    Causal Labs

    San Francisco, CA
    1 day ago
  •  ...An innovative networking company in San Francisco is seeking a Data Operations Engineer to develop systems that convert network engineering expertise into high-quality training data. This role calls for creativity in prototyping and a passion for user experience, where... 

    Meter Service

    San Francisco, CA
    4 days ago
  • 53 Stations in San Francisco is seeking a Backend Engineer to enhance network engineering tools. Your initial weeks will involve observing network engineers to understand their diagnostic processes and creating tools that empower them to generate training data independently... 

    53 Stations

    San Francisco, CA
    3 days ago
  •  ...Xterraai is looking for an ML Software Engineer to help build innovative AI agents capable of tackling complex scientific challenges. The position involves designing and developing systems that support cutting-edge research in geospatial and geophysics intelligence. The... 

    Xterraai

    San Francisco, CA
    4 days ago
  •  ...company which will de-risk the largest infrastructure build-out in history. When people finance...  ...helping to shape culture, mentor junior engineers, and learn from our customers. About...  ...fat-tree topologies You have built HPC network architectures (eBGP, fat-tree,... 
    Long term contract
    Contract work
    Fixed term contract
    Work at office
    Local area
    Visa sponsorship
    Shift work
    3 days per week

    The San Francisco Compute Company

    San Francisco, CA
    1 day ago
  • $205k - $235k

     ...Houston, Los Angeles, McLean, New York, Hoboken, Philadelphia, San Francisco, Seattle EY-Parthenon – EY Growth Platforms - AI ML Engineering – Director At EY-Parthenon, our unique blend of strategy, transactions and corporate finance, combined with cutting‑edge AI... 
    Full time
    For contractors
    Work experience placement
    Summer holiday
    Flexible hours

    Ernst & Young Oman

    San Francisco, CA
    1 day ago
  • $250k - $325k

     ...Every civilization runs on the same infrastructure: agreements between people who don't fully...  ...grown 800% over the last 12 months. Engineering at Ivo Engineers at Ivo are inventors...  ...→ prod) Design strategies to isolate ML vs API workloads while optimizing for cost... 
    Contract work
    Work at office
    Remote work

    IVO Inc

    San Francisco, CA
    1 day ago
  •  ...A blockchain analytics company in San Francisco is seeking a Senior Software Engineer, ML Infrastructure to design and operate GPU-backed systems for AI. The ideal candidate will have 5+ years of experience in building distributed infrastructure and a bachelor’s degree... 

    TRM Labs

    San Francisco, CA
    4 days ago
  •  ...Delphina-Hotels- is looking for an experienced ML Infrastructure Engineer to join their Technical Staff in San Francisco. In this pivotal role, you will help shape product direction and drive key technical decisions. Your responsibilities will include developing platforms... 

    Delphina-Hotels-

    San Francisco, CA
    4 days ago
  •  ...The Role: Why, What and the Who Infrastructure Engineers build the foundation for Ivo’s entire platform. Customers are cagey about their contracts...  ...(dev → staging → prod). Design strategies to isolate ML vs API workloads while optimizing for cost, performance, and... 

    Icehouseventures

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to HPC/ML Infrastructure Engineer. Be the first to apply!