Staff Engineer, Distributed Storage and HPC & AI Infrastructure

$250k - $300k

Together AI

Staff Engineer, Distributed Storage and HPC & AI Infrastructure

San Francisco

About the Role

In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world's largest AI training and inference workloads. You'll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing.

You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you'll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.

Responsibilities

Design multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing).
Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage.
Build Kubernetes storage operators/controllers; enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas; create reusable Helm/Terraform patterns.
Deliver 10-50 GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes.
Build multi-tier caches (local NVMe, distributed, object); optimize data locality and model-weight distribution; implement smart prefetching/eviction.
Implement monitoring, alerting, SLOs; design DR/backups with runbooks; run chaos engineering; ensure 99.9%+ uptime via proactive/automated remediation.
Partner with ML/SRE teams; mentor on storage best practices; contribute to open-source; write docs, postmortems, and public learnings.

Requirements

8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
Proven track record deploying and operating high-performance storage for GPU/HPC clusters
Deep Kubernetes and cloud-native storage experience in production environments
Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
BS/MS in Computer Science, Engineering, or equivalent practical experience
History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
Programming: Go and Python for automation, operators, and tooling
Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD)
Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations
Observability: Prometheus, Grafana, Thanos architecture and operations

Nice to Have Skills

GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE)
ML/AI storage patterns (model weights, checkpointing, dataset caching)
Kubernetes operator development (controller-runtime, kubebuilder)
Storage snapshots, cloning, and thin provisioning
Backup and disaster recovery (Velero, Restic, cross-region replication)
Storage encryption (at-rest and in-transit), security and compliance
Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $250,000 - $300,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at

Apply

Vacancy posted 2 days ago

Similar jobs that could be interesting for youBased on the Staff Engineer, Distributed Storage and HPC & AI Infrastructure in United States vacancy

Staff Infrastructure Engineer - Storage Platform
...Storage Platform Staff Infrastructure Engineer Our mission is simple: deliver seamless, secure... ...reliable, and resilient AI compute at scale. We've built... ..., storage, or distributed systems ~ Deep hands-on... ...Experience supporting AI/ML or HPC workloads Familiarity...
Suggested
Temporary work
Work at office
Flexible hours
TensorWave
Las Vegas, NV
3 days ago
Staff HPC Infrastructure Engineer
$149.4k - $205.4k
...Staff HPC Infrastructure Engineer page is loaded## Staff HPC Infrastructure Engineerlocations: Palo Alto, CAtime... ...tissue tests, real-world data and AI analytics. Guardant tests help... ...HPC system with the single namespace storage system· Help integrate cloud bursting...
Suggested
Work at office
Remote work
Work from home
Flexible hours
Guardant Health
Palo Alto, CA
3 days ago
Staff Storage Engineer HPC (Linux/storage stacks)
$159.5k - $271.2k
...expert teams of physicists, engineers, data scientists and... ..., high-performance storage solutions across BBP platforms... ...(real-time processing, AI/DL pipelines, high-... ...storage solutions in HPC or high-performance environments... ...understanding of distributed systems and cluster...
Suggested
Minimum wage
Work experience placement
Flexible hours
KLA
Milpitas, CA
2 days ago
Staff Infrastructure and Performance Engineer
...Staff Infrastructure & Performance Engineer About Nash Logistics is the substrate beneath every economy that... ..., including networking, compute, storage, and managed services. ~ Hands-on... ...programmable platform. Real-time, AI-native intelligence determines what...
Suggested
Fixed term contract
Remote work
Flexible hours
Shift work
Nash
United States
4 days ago
Senior Staff Engineer - AI Data Path
$215k - $265k
...Senior Staff Engineer Specializing In Ai Data Path & Storage DDN is seeking a highly experienced... ...across GPU, memory, and distributed storage layers, including... ...production-grade AI infrastructure. Key Responsibilities... ...performance computing (HPC) or hyperscale distributed...
Suggested
Remote work
DDN Storage
United States
2 hours ago
Staff ML Infrastructure Engineer (Compute)
About the Team The AI Validation Platform team... ...to serve as the infrastructure platform for teams developing... ...We are seeking a Staff MLInfrastructure engineer to help build and... ...running scalable distributed systems. They will rapidly... ...computing (HPC). Familiarity with...
General Motors
Sunnyvale, CA
4 days ago
Staff Infrastructure Reliability Engineer - Database & Storage
...Staff Infrastructure Reliability Engineer Please submit your application by June 30, 2026 to be considered for... ...of Redfin's production database and storage systems. They will work with the database... ...will use and evangelize approved AI code generation tools to document,...
Immediate start
Remote work
Rocket
Detroit, MI
3 days ago
Senior, Staff Backend Engineer - Distributed System
...on a mission to make AI compute ubiquitous, seamless... ...team designing the infrastructure for the AI-first world... ...We need a Backend Engineer to build the systems that... ...experience with distributed systems Strong proficiency... ...rapid iteration GPU or HPC cluster management experience...
Hourly pay
Full time
Work at office
Work from home
Visa sponsorship
SproutsAI
Palo Alto, CA
4 days ago
Staff Infrastructure Engineer
...Our partner is looking for a Staff Infrastructure Engineer based in the United States.... ...operations, networking, storage, and workload orchestration... ...Ability to troubleshoot complex distributed systems and contribute to... ...works: We use an AI-powered matching process to...
Remote job
Home office
jobgether
United States
4 days ago
Staff AI Infrastructure Engineer
$241k - $331k
...Staff AI Infrastructure Engineer Redwood City, CA (Hybrid) Biohub is the first large-scale initiative... ...the intersection of AI tooling, distributed systems, HPC, and frontier AI, debugging deep... ..., InfiniBand fabrics, large-scale storage, Slurm at scale) inside an...
Work at office
Relocation package
3 days per week
Biohub
Redwood City, CA
1 day ago
Staff Infrastructure Engineer
$225k - $300k
...About Espresso AI Espresso AI's mission is to use machine... ...learning to automate performance engineering. Today, we help our customers... .... About the Role As a Staff Infrastructure Engineer, you will design and build the distributed systems that power Espresso's...
Espresso
New York, NY
4 days ago
Staff Agentic Search Infrastructure Engineer - Moveworks
...2004 when a visionary engineer, Fred Luddy, saw the potential... ..., bringing innovative AI-enhanced technology to... ..., agent-native infrastructure foundation that agents... ...performance, and storage cost. ~ Own search... ...We approach our distributed world of work with flexibility...
Full time
Work at office
Remote work
Flexible hours
Shift work
ServiceNow
Mountain View, CA
1 day ago
Staff Security Engineer
$215k - $265k
...Sr. Staff Security Architect DDN is seeking... ...architecture across distributed storage platforms, including... ...working closely with engineering teams across the data... ...performance, multi-tenant, and AI-driven workloads. The... ...architecture, infrastructure security, or distributed...
Remote work
DDN Storage
United States
2 days ago
Staff Infrastructure Engineer
...About Obvious We're building an AI-native workspace—an operating... ...We need someone who can own the infrastructure that makes every Obvious engineer (and agent) more productive. We want... ...OpenTelemetry, Datadog, Dash0, Braintrust, distributed tracing, metrics, structured...
Local area
Remote work
Obvio.us
United States
1 day ago
Staff+ Infrastructure Engineer, Cluster Infrastructure
$320k - $405k
...interpretable, and steerable AI systems. We want AI to... ...researchers, engineers, policy experts, and business... ...role Anthropic's Infrastructure organization is... ...response to failure. As a Staff engineer on this team,... ...Deep expertise in distributed systems, reliability,...
Work at office
Visa sponsorship
Flexible hours
Anthropic
San Francisco, CA
4 days ago
Staff Infrastructure Engineer - Virtualization
...Staff Infrastructure Engineer Our mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud... ...with networking and storage teams to integrate high... ...(e.g., SR-IOV, RDMA), distributed and local storage systems...
Temporary work
Work at office
Local area
Flexible hours
Tensorwave
Las Vegas, NV
12 days ago
Staff Infrastructure Engineer
$276.5k - $300k
...25 applicants Get AI-powered advice on this... ...About the Team Our Infrastructure team is a collaborative... ...group of experienced engineers dedicated to supporting... ...We are looking for a Staff Infrastructure Engineer... ...work effectively with a distributed team and patiently...
Flexible hours
Tools for Humanity
San Francisco, CA
3 days ago
Staff Infrastructure Engineer
...collaboration and AI-powered workflow software... ...for military staffs. By transforming... ...Role Onebrief's infrastructure team owns the... ...an Infrastructure Engineer who builds security... ...operators, networking, storage, multi-cluster... ...) Big data and distributed data experience...
Remote work
Onebrief, Inc
United States
1 day ago
Senior / Staff Infrastructure Engineer
$160k - $300k
...a pioneering foundational AI company for physical product... ...mission is to revolutionize how engineering decisions are made, turning... ...the Role As a Senior / Staff Infrastructure Engineer at Apiphany, you’... ...experience (Python, APIs, distributed systems) Exposure to ML...
Work at office
Visa sponsorship
Flexible hours
APIphany
San Francisco, CA
3 days ago
Member of Technical Staff (AI Infrastructure Engineer)
...AI Infra Engineer We are looking for an AI Infra engineer to join... ...on AWS. As an AI Infrastructure Engineer, you will be partnering... ...optimize Slurm-based HPC environments for distributed training of large language... ...of networking, storage, and compute resource management...
Perplexity AI
Palo Alto, CA
1 day ago
Staff Infrastructure Engineer
About Us We’re building the AI infrastructure powering the future of... ...Role We're looking for a Staff Infrastructure Engineer to architect and own the... ...compute and networking to storage and observability. Develop... ..., working on large-scale distributed systems. Deep expertise...
Full time
Work at office
Salient
San Francisco, CA
2 days ago
Staff Engineer, Reinforcement Learning Infrastructure
$180k
...innovation firm is seeking experienced software engineers to develop and maintain software infrastructure for AI models. Located in the Bay Area, candidates should... ...Rust, as well as experience with Kubernetes and distributed systems. Responsibilities include building...
Pantera Capital
Palo Alto, CA
4 days ago
Staff Engineer, Scalable AI Inference Infrastructure
$200k - $400k
A leading AI technology company located in San Francisco is seeking an infrastructure engineer to build distributed systems for their AI inference engine. The role involves designing systems that ensure minimal latency and maximum reliability. Candidates should have a strong...
Visa sponsorship
Inferact
San Francisco, CA
19 hours ago
Staff Engineer, Content Intelligence Infrastructure
$203.41k - $290.59k
...understanding expertise. We’re now looking for a Staff Engineer to help build and scale foundational infrastructure powering content understanding across... ...of backend systems, ML infrastructure, distributed data systems, and AI-enabled platform capabilities. You’ll help...
Work from home
Flexible hours
Spotify AB
New York, NY
1 day ago
Senior Staff Engineer, Developer Infrastructure & Experience
About the Company Hippocratic AI is a generative AI company... ...About the Role As a Senior Staff Software Engineer at Hippocratic AI, you’ll... ...engineering standards, CI/CD infrastructure, and developer platform... ...experience: structured logging, distributed tracing, SLO design. #J-18...
Work at office
Local area
Hippocratic-Ai
Palo Alto, CA
1 day ago
Staff Security Engineer, Infrastructure
...Staff Security Engineer, Infrastructure San Francisco fal is the generative media ecosystem powering the next generation of AI products. We build the infrastructure, tools, and model access... ...security, infrastructure, and distributed systems. What You'll Do Build...
Shift work
fal
San Francisco, CA
1 day ago
Staff Engineer, Distributed AI Systems & Scheduling
...seeking a Member of Technical Staff to design and build distributed systems for AI workloads. The role involves developing... ...should have strong software engineering skills and experience with... ...engineers who enjoy foundational infrastructure work and can operate systems at...
Gimlet Labs
San Francisco, CA
2 days ago
Staff Machine Learning Engineer - ML Training Infrastructure
$185k - $335.3k
...driven expert in ML Training Infrastructure with a demonstrated ability... ..., and high-performance AI/ML platform infrastructure... ...development at scale. As a Staff ML Engineer, you will operate as a technical... ...efforts across distributed training workflows, improving...
Local area
Remote work
Work from home
Relocation
Relocation package
Flexible hours
General Motors
Saint Paul, MN
5 days ago
Senior/Staff Infrastructure Security Engineer
...understanding in healthcare. Our AI-powered platform was... ..., technologists, and engineers working together to empower... ...motivated "Senior or Staff Security Infrastructure Engineer" to join our team... ...streaming and pubsub platforms, storage, distribution, and more. Enable Cross-...
Hourly pay
Full time
Flexible hours
Abridge
San Francisco, CA
1 day ago
Staff ML Infrastructure Engineer - Embodied AI
$189.3k - $290.7k
...driving? Join the Embodied AI team at General Motors. Our... ...real-world scenarios. As a Staff ML Infra Engineer, you will drive the... ...experience building large-scale distributed systems, applications, or advanced... ...systems on modern cloud infrastructure-performance ~ End-to-end...
Local area
Remote work
Work from home
Relocation
Relocation package
Flexible hours
General Motors
Olympia, WA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff Engineer, Distributed Storage and HPC & AI Infrastructure. Be the first to apply!