Staff Engineer, Distributed Storage and HPC & AI Infrastructure
$250k - $300kTogether AI
Staff Engineer, Distributed Storage and HPC & AI Infrastructure
San Francisco
About the Role
In this role, you will design and deliver multi-petabyte storage systems purpose-built for the world's largest AI training and inference workloads. You'll architect high-performance parallel filesystems and object stores, evaluate and integrate cutting-edge technologies such as WekaFS, Ceph, and Lustre, and drive aggressive cost optimization-routinely achieving 30-50% savings through intelligent tiering, lifecycle policies, capacity forecasting, and right-sizing.
You will also build Kubernetes-native storage operators and self-service platforms that provide automated provisioning, strict multi-tenancy, performance isolation, and quota enforcement at cluster scale. Day-to-day, you'll optimize end-to-end data paths for 10-50 GB/s per node, design multi-tier caching architectures, implement intelligent prefetching and model-weight distribution, and tune parallel filesystems for AI workloads.
Responsibilities
- Design multi-petabyte AI/ML storage systems; integrate WekaFS, Ceph, etc.; lead capacity planning and cost optimization (30-50% savings via tiering, lifecycle policies, right-sizing).
- Design/optimize RDMA, InfiniBand, 400GbE networks; tune for max throughput/min latency; implement NVMe-oF/iSCSI; troubleshoot bottlenecks; optimize TCP/IP for storage.
- Build Kubernetes storage operators/controllers; enable automated provisioning, self-service abstractions, multi-tenant isolation, quotas; create reusable Helm/Terraform patterns.
- Deliver 10-50 GB/s per GPU node; optimize caching (weights/datasets/checkpoints), parallel filesystems, and data paths; troubleshoot with profiling tools; scale to thousands of nodes.
- Build multi-tier caches (local NVMe, distributed, object); optimize data locality and model-weight distribution; implement smart prefetching/eviction.
- Implement monitoring, alerting, SLOs; design DR/backups with runbooks; run chaos engineering; ensure 99.9%+ uptime via proactive/automated remediation.
- Partner with ML/SRE teams; mentor on storage best practices; contribute to open-source; write docs, postmortems, and public learnings.
Requirements
- 8+ years in storage engineering with 3+ years managing distributed storage at multi-petabyte scale
- Proven track record deploying and operating high-performance storage for GPU/HPC clusters
- Deep Kubernetes and cloud-native storage experience in production environments
- Strong coding skills in Go and Python with demonstrated ability to build production-grade tools
- BS/MS in Computer Science, Engineering, or equivalent practical experience
- History of technical leadership: designing systems that significantly improved performance (>3x), reliability (99.9%+ uptime), or cost efficiency
- Distributed Storage Systems: Deep expertise in WekaFS, Lustre, GPFS, BeeGFS, or similar parallel filesystems at multi-petabyte scale
- Object Storage: Production experience with S3, MinIO, Ceph, or R2 including performance optimization and cost management
- Kubernetes Storage: CSI drivers, StatefulSets, PersistentVolumes, storage operators, and custom controllers
- Storage optimization for GPU workloads, RDMA/InfiniBand networking, parallel filesystem optimization (100+ GB/s aggregate cluster throughput)
- Programming: Go and Python for automation, operators, and tooling
- Infrastructure as Code: Terraform, Ansible, Helm, GitOps (ArgoCD)
- Linux Storage Stack: Advanced knowledge of filesystems (ext4, xfs), LVM, NVMe optimization, RAID configurations
- Observability: Prometheus, Grafana, Thanos architecture and operations
Nice to Have Skills
- GPU Direct Storage (GDS), NVMe-oF, storage networking (100GbE/400GbE)
- ML/AI storage patterns (model weights, checkpointing, dataset caching)
- Kubernetes operator development (controller-runtime, kubebuilder)
- Storage snapshots, cloning, and thin provisioning
- Backup and disaster recovery (Velero, Restic, cross-region replication)
- Storage encryption (at-rest and in-transit), security and compliance
- Storage benchmarking and profiling tools (fio, iperf3, iostat, blktrace)
About Together AI
Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.
Compensation
We offer competitive compensation, startup equity, health insurance, and other benefits, as well as flexibility in terms of remote work. The US base salary range for this full-time position is: $250,000 - $300,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.
Equal Opportunity
Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
Please see our privacy policy at
- ...Storage Platform Staff Infrastructure Engineer Our mission is simple: deliver seamless, secure... ...reliable, and resilient AI compute at scale. We've built... ..., storage, or distributed systems ~ Deep hands-on... ...Experience supporting AI/ML or HPC workloads Familiarity...SuggestedTemporary workWork at officeFlexible hours
$149.4k - $205.4k
...Staff HPC Infrastructure Engineer page is loaded## Staff HPC Infrastructure Engineerlocations: Palo Alto, CAtime... ...tissue tests, real-world data and AI analytics. Guardant tests help... ...HPC system with the single namespace storage system· Help integrate cloud bursting...SuggestedWork at officeRemote workWork from homeFlexible hours$159.5k - $271.2k
...expert teams of physicists, engineers, data scientists and... ..., high-performance storage solutions across BBP platforms... ...(real-time processing, AI/DL pipelines, high-... ...storage solutions in HPC or high-performance environments... ...understanding of distributed systems and cluster...SuggestedMinimum wageWork experience placementFlexible hours- ...Staff Infrastructure & Performance Engineer About Nash Logistics is the substrate beneath every economy that... ..., including networking, compute, storage, and managed services. ~ Hands-on... ...programmable platform. Real-time, AI-native intelligence determines what...SuggestedFixed term contractRemote workFlexible hoursShift work
$215k - $265k
...Senior Staff Engineer Specializing In Ai Data Path & Storage DDN is seeking a highly experienced... ...across GPU, memory, and distributed storage layers, including... ...production-grade AI infrastructure. Key Responsibilities... ...performance computing (HPC) or hyperscale distributed...SuggestedRemote work- About the Team The AI Validation Platform team... ...to serve as the infrastructure platform for teams developing... ...We are seeking a Staff MLInfrastructure engineer to help build and... ...running scalable distributed systems. They will rapidly... ...computing (HPC). Familiarity with...
- ...Staff Infrastructure Reliability Engineer Please submit your application by June 30, 2026 to be considered for... ...of Redfin's production database and storage systems. They will work with the database... ...will use and evangelize approved AI code generation tools to document,...Immediate startRemote work
- ...on a mission to make AI compute ubiquitous, seamless... ...team designing the infrastructure for the AI-first world... ...We need a Backend Engineer to build the systems that... ...experience with distributed systems Strong proficiency... ...rapid iteration GPU or HPC cluster management experience...Hourly payFull timeWork at officeWork from homeVisa sponsorship
- ...Our partner is looking for a Staff Infrastructure Engineer based in the United States.... ...operations, networking, storage, and workload orchestration... ...Ability to troubleshoot complex distributed systems and contribute to... ...works: We use an AI-powered matching process to...Remote jobHome office
$241k - $331k
...Staff AI Infrastructure Engineer Redwood City, CA (Hybrid) Biohub is the first large-scale initiative... ...the intersection of AI tooling, distributed systems, HPC, and frontier AI, debugging deep... ..., InfiniBand fabrics, large-scale storage, Slurm at scale) inside an...Work at officeRelocation package3 days per week$225k - $300k
...About Espresso AI Espresso AI's mission is to use machine... ...learning to automate performance engineering. Today, we help our customers... .... About the Role As a Staff Infrastructure Engineer, you will design and build the distributed systems that power Espresso's...- ...2004 when a visionary engineer, Fred Luddy, saw the potential... ..., bringing innovative AI-enhanced technology to... ..., agent-native infrastructure foundation that agents... ...performance, and storage cost. ~ Own search... ...We approach our distributed world of work with flexibility...Full timeWork at officeRemote workFlexible hoursShift work
$215k - $265k
...Sr. Staff Security Architect DDN is seeking... ...architecture across distributed storage platforms, including... ...working closely with engineering teams across the data... ...performance, multi-tenant, and AI-driven workloads. The... ...architecture, infrastructure security, or distributed...Remote work- ...About Obvious We're building an AI-native workspace—an operating... ...We need someone who can own the infrastructure that makes every Obvious engineer (and agent) more productive. We want... ...OpenTelemetry, Datadog, Dash0, Braintrust, distributed tracing, metrics, structured...Local areaRemote work
$320k - $405k
...interpretable, and steerable AI systems. We want AI to... ...researchers, engineers, policy experts, and business... ...role Anthropic's Infrastructure organization is... ...response to failure. As a Staff engineer on this team,... ...Deep expertise in distributed systems, reliability,...Work at officeVisa sponsorshipFlexible hours- ...Staff Infrastructure Engineer Our mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud... ...with networking and storage teams to integrate high... ...(e.g., SR-IOV, RDMA), distributed and local storage systems...Temporary workWork at officeLocal areaFlexible hours
$276.5k - $300k
...25 applicants Get AI-powered advice on this... ...About the Team Our Infrastructure team is a collaborative... ...group of experienced engineers dedicated to supporting... ...We are looking for a Staff Infrastructure Engineer... ...work effectively with a distributed team and patiently...Flexible hours- ...collaboration and AI-powered workflow software... ...for military staffs. By transforming... ...Role Onebrief's infrastructure team owns the... ...an Infrastructure Engineer who builds security... ...operators, networking, storage, multi-cluster... ...) Big data and distributed data experience...Remote work
$160k - $300k
...a pioneering foundational AI company for physical product... ...mission is to revolutionize how engineering decisions are made, turning... ...the Role As a Senior / Staff Infrastructure Engineer at Apiphany, you’... ...experience (Python, APIs, distributed systems) Exposure to ML...Work at officeVisa sponsorshipFlexible hours- ...AI Infra Engineer We are looking for an AI Infra engineer to join... ...on AWS. As an AI Infrastructure Engineer, you will be partnering... ...optimize Slurm-based HPC environments for distributed training of large language... ...of networking, storage, and compute resource management...
- About Us We’re building the AI infrastructure powering the future of... ...Role We're looking for a Staff Infrastructure Engineer to architect and own the... ...compute and networking to storage and observability. Develop... ..., working on large-scale distributed systems. Deep expertise...Full timeWork at office
$180k
...innovation firm is seeking experienced software engineers to develop and maintain software infrastructure for AI models. Located in the Bay Area, candidates should... ...Rust, as well as experience with Kubernetes and distributed systems. Responsibilities include building...$200k - $400k
A leading AI technology company located in San Francisco is seeking an infrastructure engineer to build distributed systems for their AI inference engine. The role involves designing systems that ensure minimal latency and maximum reliability. Candidates should have a strong...Visa sponsorship$203.41k - $290.59k
...understanding expertise. We’re now looking for a Staff Engineer to help build and scale foundational infrastructure powering content understanding across... ...of backend systems, ML infrastructure, distributed data systems, and AI-enabled platform capabilities. You’ll help...Work from homeFlexible hours- About the Company Hippocratic AI is a generative AI company... ...About the Role As a Senior Staff Software Engineer at Hippocratic AI, you’ll... ...engineering standards, CI/CD infrastructure, and developer platform... ...experience: structured logging, distributed tracing, SLO design. #J-18...Work at officeLocal area
- ...Staff Security Engineer, Infrastructure San Francisco fal is the generative media ecosystem powering the next generation of AI products. We build the infrastructure, tools, and model access... ...security, infrastructure, and distributed systems. What You'll Do Build...Shift work
- ...seeking a Member of Technical Staff to design and build distributed systems for AI workloads. The role involves developing... ...should have strong software engineering skills and experience with... ...engineers who enjoy foundational infrastructure work and can operate systems at...
$185k - $335.3k
...driven expert in ML Training Infrastructure with a demonstrated ability... ..., and high-performance AI/ML platform infrastructure... ...development at scale. As a Staff ML Engineer, you will operate as a technical... ...efforts across distributed training workflows, improving...Local areaRemote workWork from homeRelocationRelocation packageFlexible hours- ...understanding in healthcare. Our AI-powered platform was... ..., technologists, and engineers working together to empower... ...motivated "Senior or Staff Security Infrastructure Engineer" to join our team... ...streaming and pubsub platforms, storage, distribution, and more. Enable Cross-...Hourly payFull timeFlexible hours
$189.3k - $290.7k
...driving? Join the Embodied AI team at General Motors. Our... ...real-world scenarios. As a Staff ML Infra Engineer, you will drive the... ...experience building large-scale distributed systems, applications, or advanced... ...systems on modern cloud infrastructure-performance ~ End-to-end...Local areaRemote workWork from homeRelocationRelocation packageFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Staff Engineer, Distributed Storage and HPC & AI Infrastructure. Be the first to apply!
- software engineer staff United States
- staff devops engineer United States
- information technology support assistant United States
- assistant engineer United States
- structural engineering assistant United States
- assistant engineering manager United States
- engineering administrative assistant United States
- staff design engineer United States
- project engineer assistant project manager United States
- technology administrator United States


