Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

$250k - $300k

Together AI

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

San Francisco

About the Role

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world's largest AI training and inference workloads. You'll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing.

You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you'll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.

Responsibilities

  • Design multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing).
  • Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage.
  • Build Kubernetes storage operators/controllers; enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas; create reusable Helm/Terraform patterns.
  • Deliver 10-50 GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes.
  • Build multi-tier caches (local NVMe, distributed, object); optimize data locality and model-weight distribution; implement smart prefetching/eviction.
  • Implement monitoring, alerting, SLOs; design DR/backups with runbooks; run chaos engineering; ensure 99.9%+ uptime via proactive/automated remediation.
  • Partner with ML/SRE teams; mentor on storage best practices; contribute to open-source; write docs, postmortems, and public learnings.
Requirements
  • 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
  • Proven track record deploying and operating high-performance storage for GPU/HPC clusters
  • Deep Kubernetes and cloud-native storage experience in production environments
  • Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
  • BS/MS in Computer Science, Engineering, or equivalent practical experience
  • History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
  • Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
  • Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
  • Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
  • Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
  • Programming: Go and Python for automation, operators, and tooling
  • Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD)
  • Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations
  • Observability: Prometheus, Grafana, Thanos architecture and operations
Nice to Have Skills
  • GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE)
  • ML/AI storage patterns (model weights, checkpointing, dataset caching)
  • Kubernetes operator development (controller-runtime, kubebuilder)
  • Storage snapshots, cloning, and thin provisioning
  • Backup and disaster recovery (Velero, Restic, cross-region replication)
  • Storage encryption (at-rest and in-transit), security and compliance
  • Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)
About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $250,000 - $300,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Staff Engineer, Distributed Storage and HPC & AI Infrastructure in United States vacancy
  •  ...Storage Platform Staff Infrastructure Engineer Our mission is simple: deliver seamless, secure...  ...reliable, and resilient AI compute at scale. We've built...  ..., storage, or distributed systems ~ Deep hands-on...  ...Experience supporting AI/ML or HPC workloads Familiarity... 
    Suggested
    Temporary work
    Work at office
    Flexible hours

    TensorWave

    Las Vegas, NV
    3 days ago
  • $149.4k - $205.4k

     ...Staff HPC Infrastructure Engineer page is loaded## Staff HPC Infrastructure Engineerlocations: Palo Alto, CAtime...  ...tissue tests, real-world data and AI analytics. Guardant tests help...  ...HPC system with the single namespace storage system· Help integrate cloud bursting... 
    Suggested
    Work at office
    Remote work
    Work from home
    Flexible hours

    Guardant Health

    Palo Alto, CA
    3 days ago
  • $159.5k - $271.2k

     ...expert teams of physicists, engineers, data scientists and...  ..., high-performance storage solutions across BBP platforms...  ...(real-time processing, AI/DL pipelines, high-...  ...storage solutions in HPC or high-performance environments...  ...understanding of distributed systems and cluster... 
    Suggested
    Minimum wage
    Work experience placement
    Flexible hours

    KLA

    Milpitas, CA
    2 days ago
  •  ...Staff Infrastructure & Performance Engineer About Nash Logistics is the substrate beneath every economy that...  ..., including networking, compute, storage, and managed services. ~ Hands-on...  ...programmable platform. Real-time, AI-native intelligence determines what... 
    Suggested
    Fixed term contract
    Remote work
    Flexible hours
    Shift work

    Nash

    United States
    4 days ago
  • $215k - $265k

     ...Senior Staff Engineer Specializing In Ai Data Path & Storage DDN is seeking a highly experienced...  ...across GPU, memory, and distributed storage layers, including...  ...production-grade AI infrastructure. Key Responsibilities...  ...performance computing (HPC) or hyperscale distributed... 
    Suggested
    Remote work

    DDN Storage

    United States
    2 hours ago
  • About the Team The AI Validation Platform team...  ...to serve as the infrastructure platform for teams developing...  ...We are seeking a Staff MLInfrastructure engineer to help build and...  ...running scalable distributed systems. They will rapidly...  ...computing (HPC). Familiarity with... 

    General Motors

    Sunnyvale, CA
    4 days ago
  •  ...Staff Infrastructure Reliability Engineer Please submit your application by June 30, 2026 to be considered for...  ...of Redfin's production database and storage systems. They will work with the database...  ...will use and evangelize approved AI code generation tools to document,... 
    Immediate start
    Remote work

    Rocket

    Detroit, MI
    3 days ago
  •  ...on a mission to make AI compute ubiquitous, seamless...  ...team designing the infrastructure for the AI-first world...  ...We need a Backend Engineer to build the systems that...  ...experience with distributed systems Strong proficiency...  ...rapid iteration GPU or HPC cluster management experience... 
    Hourly pay
    Full time
    Work at office
    Work from home
    Visa sponsorship

    SproutsAI

    Palo Alto, CA
    4 days ago
  •  ...Our partner is looking for a Staff Infrastructure Engineer based in the United States....  ...operations, networking, storage, and workload orchestration...  ...Ability to troubleshoot complex distributed systems and contribute to...  ...works: We use an AI-powered matching process to... 
    Remote job
    Home office

    jobgether

    United States
    4 days ago
  • $241k - $331k

     ...Staff AI Infrastructure Engineer Redwood City, CA (Hybrid) Biohub is the first large-scale initiative...  ...the intersection of AI tooling, distributed systems, HPC, and frontier AI, debugging deep...  ..., InfiniBand fabrics, large-scale storage, Slurm at scale) inside an... 
    Work at office
    Relocation package
    3 days per week

    Biohub

    Redwood City, CA
    1 day ago
  • $225k - $300k

     ...About Espresso AI Espresso AI's mission is to use machine...  ...learning to automate performance engineering. Today, we help our customers...  .... About the Role As a Staff Infrastructure Engineer, you will design and build the distributed systems that power Espresso's... 

    Espresso

    New York, NY
    4 days ago
  •  ...2004 when a visionary engineer, Fred Luddy, saw the potential...  ..., bringing innovative AI-enhanced technology to...  ..., agent-native infrastructure foundation that agents...  ...performance, and storage cost. ~ Own search...  ...We approach our distributed world of work with flexibility... 
    Full time
    Work at office
    Remote work
    Flexible hours
    Shift work

    ServiceNow

    Mountain View, CA
    1 day ago
  • $215k - $265k

     ...Sr. Staff Security Architect DDN is seeking...  ...architecture across distributed storage platforms, including...  ...working closely with engineering teams across the data...  ...performance, multi-tenant, and AI-driven workloads. The...  ...architecture, infrastructure security, or distributed... 
    Remote work

    DDN Storage

    United States
    2 days ago
  •  ...About Obvious We're building an AI-native workspace—an operating...  ...We need someone who can own the infrastructure that makes every Obvious engineer (and agent) more productive. We want...  ...OpenTelemetry, Datadog, Dash0, Braintrust, distributed tracing, metrics, structured... 
    Local area
    Remote work

    Obvio.us

    United States
    1 day ago
  • $320k - $405k

     ...interpretable, and steerable AI systems. We want AI to...  ...researchers, engineers, policy experts, and business...  ...role Anthropic's Infrastructure organization is...  ...response to failure. As a Staff engineer on this team,...  ...Deep expertise in distributed systems, reliability,... 
    Work at office
    Visa sponsorship
    Flexible hours

    Anthropic

    San Francisco, CA
    4 days ago
  •  ...Staff Infrastructure Engineer Our mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud...  ...with networking and storage teams to integrate high...  ...(e.g., SR-IOV, RDMA), distributed and local storage systems... 
    Temporary work
    Work at office
    Local area
    Flexible hours

    Tensorwave

    Las Vegas, NV
    12 days ago
  • $276.5k - $300k

     ...25 applicants Get AI-powered advice on this...  ...About the Team Our Infrastructure team is a collaborative...  ...group of experienced engineers dedicated to supporting...  ...We are looking for a Staff Infrastructure Engineer...  ...work effectively with a distributed team and patiently... 
    Flexible hours

    Tools for Humanity

    San Francisco, CA
    3 days ago
  •  ...collaboration and AI-powered workflow software...  ...for military staffs. By transforming...  ...Role Onebrief's infrastructure team owns the...  ...an Infrastructure Engineer who builds security...  ...operators, networking, storage, multi-cluster...  ...) Big data and distributed data experience... 
    Remote work

    Onebrief, Inc

    United States
    1 day ago
  • $160k - $300k

     ...a pioneering foundational AI company for physical product...  ...mission is to revolutionize how engineering decisions are made, turning...  ...the Role As a Senior / Staff Infrastructure Engineer at Apiphany, you’...  ...experience (Python, APIs, distributed systems) Exposure to ML... 
    Work at office
    Visa sponsorship
    Flexible hours

    APIphany

    San Francisco, CA
    3 days ago
  •  ...AI Infra Engineer We are looking for an AI Infra engineer to join...  ...on AWS. As an AI Infrastructure Engineer, you will be partnering...  ...optimize Slurm-based HPC environments for distributed training of large language...  ...of networking, storage, and compute resource management... 

    Perplexity AI

    Palo Alto, CA
    1 day ago
  • About Us We’re building the AI infrastructure powering the future of...  ...Role We're looking for a Staff Infrastructure Engineer to architect and own the...  ...compute and networking to storage and observability. Develop...  ..., working on large-scale distributed systems. Deep expertise... 
    Full time
    Work at office

    Salient

    San Francisco, CA
    2 days ago
  • $180k

     ...innovation firm is seeking experienced software engineers to develop and maintain software infrastructure for AI models. Located in the Bay Area, candidates should...  ...Rust, as well as experience with Kubernetes and distributed systems. Responsibilities include building... 

    Pantera Capital

    Palo Alto, CA
    4 days ago
  • $200k - $400k

    A leading AI technology company located in San Francisco is seeking an infrastructure engineer to build distributed systems for their AI inference engine. The role involves designing systems that ensure minimal latency and maximum reliability. Candidates should have a strong... 
    Visa sponsorship

    Inferact

    San Francisco, CA
    19 hours ago
  • $203.41k - $290.59k

     ...understanding expertise. We’re now looking for a Staff Engineer to help build and scale foundational infrastructure powering content understanding across...  ...of backend systems, ML infrastructure, distributed data systems, and AI-enabled platform capabilities. You’ll help... 
    Work from home
    Flexible hours

    Spotify AB

    New York, NY
    1 day ago
  • About the Company Hippocratic AI is a generative AI company...  ...About the Role As a Senior Staff Software Engineer at Hippocratic AI, you’ll...  ...engineering standards, CI/CD infrastructure, and developer platform...  ...experience: structured logging, distributed tracing, SLO design. #J-18... 
    Work at office
    Local area

    Hippocratic-Ai

    Palo Alto, CA
    1 day ago
  •  ...Staff Security Engineer, Infrastructure San Francisco fal is the generative media ecosystem powering the next generation of AI products. We build the infrastructure, tools, and model access...  ...security, infrastructure, and distributed systems. What You'll Do Build... 
    Shift work

    fal

    San Francisco, CA
    1 day ago
  •  ...seeking a Member of Technical Staff to design and build distributed systems for AI workloads. The role involves developing...  ...should have strong software engineering skills and experience with...  ...engineers who enjoy foundational infrastructure work and can operate systems at... 

    Gimlet Labs

    San Francisco, CA
    2 days ago
  • $185k - $335.3k

     ...driven expert in ML Training Infrastructure with a demonstrated ability...  ..., and high-performance AI/ML platform infrastructure...  ...development at scale. As a Staff ML Engineer, you will operate as a technical...  ...efforts across distributed training workflows, improving... 
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Saint Paul, MN
    5 days ago
  •  ...understanding in healthcare. Our AI-powered platform was...  ..., technologists, and engineers working together to empower...  ...motivated "Senior or Staff Security Infrastructure Engineer" to join our team...  ...streaming and pubsub platforms, storage, distribution, and more. Enable Cross-... 
    Hourly pay
    Full time
    Flexible hours

    Abridge

    San Francisco, CA
    1 day ago
  • $189.3k - $290.7k

     ...driving? Join the Embodied AI team at General Motors. Our...  ...real-world scenarios. As a Staff ML Infra Engineer, you will drive the...  ...experience building large-scale distributed systems, applications, or advanced...  ...systems on modern cloud infrastructure-performance ~ End-to-end... 
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Olympia, WA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff Engineer, Distributed Storage and HPC & AI Infrastructure. Be the first to apply!