Staff Infrastructure Engineer, Cluster Infrastructure

$320k - $405k

United States Digital Space LLC

About the company The company’s mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build beneficial AI systems. About the role The company's Infrastructure organization is foundational to our mission of developing AI systems that are reliable, interpretable, and steerable. The systems we build determine how quickly we can train new models, how reliably we can run safety experiments, and how effectively we can scale Claude to millions of users — demonstrating that safe, reliable infrastructure and frontier capabilities can go hand in hand. Cluster Infra owns the full lifecycle of compute clusters at the company. We build agent-driven automation for cluster provisioning and lifecycle management across all major cloud providers and our own datacenters. Our systems stand up clusters that are interconnected with high bandwidth, secure-by-default, and able to automatically drain and recover in response to failure. As a Staff engineer on this team, you'll set the technical direction for how the company brings compute online - at a moment when the scale of that compute is growing faster than at almost any company in the world. Key responsibilities Own the technical strategy and roadmap for agent-driven cluster lifecycle management - provisioning, updates and decommissioning Partner across teams to ensure new compute capacity is ingested on time Align with partner teams on physical build-out and leverage cloud solutions to deliver high-bandwidth inter-cluster connectivity Collaborate with security owners to ensure clusters are provisioned secure-by-default Define and drive strategy on cluster scalability, homogeneity and fault tolerance Work closely with cloud providers and internal research, inference and product teams to shape long-term compute, data, and infrastructure strategy Establish and evolve operational-excellence practices: incident response, postmortem culture and on-call health Support the growth of engineers around you through technical mentorship and coaching Minimum qualifications Deep expertise in distributed systems, reliability, and cloud platforms (e.g., Kubernetes, IaC, AWS/GCP/Azure) Strong proficiency in at least one systems language (e.g., Rust, Go, or Python), IaC proficiency with Terraform. Track record of leading complex, multi-quarter technical initiatives spanning multiple teams or systems Ability to build alignment across senior stakeholders and communicate effectively at all levels Preferred qualifications 8+ years of software engineering experience, including time as a technical lead setting direction for a team Experience operating large-scale compute infrastructure at hyperscale (100+ clusters, 10K+ nodes) Depth in one or more of: Kubernetes internals, cluster provisioning and management systems, cluster orchestration systems (Mesos, Borg-like) Experience with cloud networking: VPC design and peering, Shared VPC/Transit Gateway, Cloud Interconnect/Direct Connect, Cloud NAT, cross-cloud private connectivity, BGP and route control, edge load balancing and DDoS mitigation (Cloud Armor / AWS Shield) Experience with cluster and host networking: CNI (Cilium), eBPF, NetworkPolicy, multi-NIC, sFlow, service mesh (Istio/Envoy/Linkerd, mTLS) Experience with cluster security: pod security standards and admission control, RBAC and least-privilege IAM, node and container hardening, supply-chain/image provenance Deep experience with infrastructure-as-code (Terraform, Atlantis), workflow orchestration (Temporal, Argo Workflows) Skill in quickly understanding systems design tradeoffs and keeping track of rapidly evolving software systems Compensation

$320,000 — $405,000 USD

Logistics Minimum education: Bachelor’s degree or an equivalent combination of education, training, and/or experience Required field of study: A field relevant to the role as demonstrated through coursework, training, or professional experience Minimum years of experience: Years of experience required will correlate with the internal job level requirements for the position Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time. However, some roles may require more time in our offices. Visa sponsorship: We do sponsor visas! However, we aren't able to successfully sponsor visas for every role and every candidate. But if we make you an offer, we will make every reasonable effort to get you a visa, and we retain an immigration lawyer to help with this. How we're different We believe that the highest-impact AI research will be big science. At the company we work as a single cohesive team on just a few large-scale research efforts. And we value impact — advancing our long-term goals of steerable, trustworthy AI — rather than work on smaller and more specific puzzles. We view AI research as an empirical science, which has as much in common with physics and biology as with traditional efforts in computer science. We're an extremely collaborative group, and we host frequent research discussions to ensure that we are pursuing the highest-impact work at any given time. As such, we greatly value communication. #J-18808-Ljbffr United States Digital Space LLC

Apply

Vacancy posted 4 hours ago

Similar jobs that could be interesting for youBased on the Staff Infrastructure Engineer, Cluster Infrastructure in San Francisco, CA vacancy

Staff Cluster Infrastructure Engineer
$224k - $284k
...improve them until they work at scale. We are roboticists, engineers, operators, and builders. We believe the next great... ...world impact, join us. What you’ll do We’re seeking a Cluster Infrastructure Engineer to join our founding team who will own the GPU compute...
Suggested
Full time
Work at office
Flexible hours
ATOMS Careers page
San Francisco, CA
2 days ago
Staff Infrastructure and Performance Engineer
Staff Infrastructure & Performance Engineer As a Staff Infrastructure Performance & Engineer , you will own and evolve the performance, reliability, and... ...indexing strategies, connection management, replication, cluster design, and failover. Architect and operate multi-...
Suggested
Fixed term contract
Flexible hours
Nash
San Francisco, CA
14 hours ago
Staff Network Engineer, Deployment
$300 per month
...energy and intelligence. We’re crafting the engine that powers a world where people can... ...for responsible, transformative cloud infrastructure. About this role Crusoe Cloud Network... ...network for High-Performance Compute (HPC) Clusters with GPUs. The ideal individual will be...
Suggested
Temporary work
Work experience placement
Work at office
Crusoe Energy Systems LLC
San Francisco, CA
4 days ago
Staff Network Engineer, Operations
$195k - $235k
...the only vertically integrated AI infrastructure company built from the ground up, we... ...This Role Crusoe Cloud is seeking a Staff Network Operations Engineer to help own production reliability... ...backbone, data center fabric, and GPU cluster interconnects. This is a hands‑on...
Suggested
Temporary work
Worldwide
Crusoe Energy Systems LLC
San Francisco, CA
4 days ago
Staff Network Engineer, Deployment
$193k - $234k
...the only vertically integrated AI infrastructure company built from the ground up, we... ...seeking a high-energy, detail-oriented Staff Network Deployment Engineer to lead the physical and logical... ...testing (SAT) for new network clusters, ensuring zero-defect handovers to...
Suggested
Temporary work
Remote work
Crusoe Energy Systems
San Francisco, CA
1 day ago
Senior / Staff Infrastructure Engineer
$160k - $300k
...Databricks, GM, and Character, our mission is to revolutionize how engineering decisions are made, turning complexity into clarity for the... ...company together. About the Role As a Senior / Staff Infrastructure Engineer at Apiphany, you’ll design, build, and operate the...
Work at office
Visa sponsorship
Flexible hours
Apiphany
San Francisco, CA
4 days ago
Principal/Staff HPC Network Engineer
$250k - $325k
...Francisco, CA Employment Type Full time Department Engineering Compensation $250K - $325K We're building the company which will de-risk the largest infrastructure build‑out in history. When people finance GPU clusters, the datacenters housing them, and the...
Long term contract
Full time
Contract work
Fixed term contract
Work at office
Local area
Visa sponsorship
Shift work
3 days per week
Electric Capital
San Francisco, CA
14 hours ago
Staff Network Deployment Engineer, Lab
$193k - $234k
...the only vertically integrated AI infrastructure company built from the ground up, we... ...seeking a high-energy, detail-oriented Staff Network Deployment Engineer to lead the physical and logical... ...networks for GPU compute clusters. As we rapidly expand our footprint...
Temporary work
Work at office
Remote work
Crusoe Energy Systems LLC
San Francisco, CA
2 days ago
Staff HPC Network Engineer
$224k - $284k
...they work at scale. We are roboticists, engineers, operators, and builders. We believe... ...and the architecture to scale it as our cluster grows. Design, optimize, and scale the... ...efficiency. Collaborate across the infrastructure team to solve cross-discipline problems...
Full time
Work at office
Immediate start
Flexible hours
Atoms
San Francisco, CA
2 days ago
Staff Platform Engineer
...Role Abridge’s services and engineering teams are in hyperscale mode... ...are looking for experienced Staff Platform Engineers to join our... ...and help scale our cloud infrastructure, developer platform, and operational... ...upgrades, and multi-tenant cluster design. Experience designing...
Hourly pay
Full time
Local area
Remote work
Flexible hours
Neura Market
San Francisco, CA
1 day ago
Staff Infrastructure Engineer
...About Us We’re building the AI infrastructure powering the future of financial operations - starting with automating the most... ...performance matter most. About the Role We're looking for a Staff Infrastructure Engineer to architect and own the systems that power Salient at...
Full time
Work at office
Salient
San Francisco, CA
14 hours ago
Staff Infrastructure Engineer
$276.5k - $300k
...be owned by everyone. About the Team Our Infrastructure team is a collaborative group of experienced engineers dedicated to supporting the World project's mission... ...About the Opportunity We are looking for a Staff Infrastructure Engineer to help establish our team...
Flexible hours
World Coin
San Francisco, CA
14 hours ago
Staff Infrastructure Engineer
...This role is infrastructure-first, with a second gear in backend or QA. Hamilton is building the operating system for charter aviation... ..., and resilient. That's your job. We're hiring a Staff Platform Engineer to own the infrastructure and internal platforms that let...
Second job
Visa sponsorship
Hamilton AI
San Francisco, CA
14 hours ago
Infrastructure / Cluster Engineer
Gimlet is building AI infrastructure and orchestration platforms for large-scale AI datacenters. This Infrastructure/Cluster Engineer role involves designing, building, and operating heterogeneous cluster infrastructure that intelligently routes workloads across diverse...
Linuxcareers
San Francisco, CA
4 days ago
Staff Cloud Support Engineer
$300 per month
...intelligence. We’re crafting the engine that powers a world where people can... ...responsible, transformative cloud infrastructure. About the Role As a Senior Staff Cloud Support Engineer , you are a... ...(Slurm, Terraform), and AI/ML cluster stability. Reduce MTTR and incident...
Full time
Temporary work
Epoch Biodesign
San Francisco, CA
3 days ago
Staff Infrastructure Engineer
$300 per month
...and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate... ...with us at Crusoe. About the Role: We are seeking a Staff Software Infrastructure Engineer to play a critical role in managing Crusoe’s fleet operations...
Temporary work
Crusoe
San Francisco, CA
2 days ago
Senior Staff Infrastructure Engineer | Scale AI Infra in SF
A leading AI infrastructure company is seeking a Staff Infrastructure Engineer in San Francisco. In this role, you will own the systems that power the company at scale, focusing on reliability, scalability, and developer velocity. You will be responsible for designing cloud...
Work at office
Salient
San Francisco, CA
14 hours ago
Member of Technical Staff (AI Infrastructure Engineer)
We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes... ..., and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely... ...large-scale AI training and inference clusters Responsibilities Design, deploy, and maintain...
Perplexity
San Francisco, CA
2 days ago
Staff AI Platform Engineer
$178k - $267k
...Ireland. Come join us! About the Team Our diverse Product & Engineering team values innovation, collaboration, and the continuous improvement... ...software engineer to join our AI & Search mission. This cluster of teams is responsible for developing the next‑generation Generative...
Local area
BetterCloud
San Francisco, CA
2 days ago
Senior / Staff Network Engineer, San Francisco
What you’ll do As a Senior / Staff Network Engineer, you will define the global technical strategy, architecture, and roadmap for Airwallex’s enterprise and cloud network infrastructure. You will design and deploy highly secure, multi-region hybrid network patterns that...
Flexible hours
Weekend work
Airwallex-
San Francisco, CA
3 days ago
Staff Fiber Network Engineer
$320k - $405k
...growing group of committed researchers, engineers, policy experts, and business leaders... ...systems and routers. We're looking for a Staff Fiber Network Engineer to own the... ...and wavelength options from carriers and infrastructure providers. Run RFPs, compare bids on cost...
Visa sponsorship
Night shift
anthropic
San Francisco, CA
1 day ago
Staff Analytics Engineer, Subledger Platform
$215k - $265k
...controls, and automation across the org. We’re looking for a Staff Analytics Engineer to build and own our Financial Subledger Data Platform —... ...expertise: architecture patterns (micro‑partitioning/clustering, query optimization), security/governance (RBAC, masking policies...
Work at office
Remote work
Flexible hours
Affirm
San Francisco, CA
1 day ago
Staff Global Network Deployment Engineer
Crusoe Energy Systems in Sunnyvale is looking for a Staff Network Deployment Engineer to lead the deployment of network infrastructure across data centers. The role involves managing technical implementations and ensuring compliance with high-performance standards. Ideal...
Crusoe Energy Systems
San Francisco, CA
5 days ago
Staff Data Infrastructure Engineer — Scale & Reliability
B Capital is looking for a Staff Software Engineer to join the Data Infrastructure team. The role focuses on building secure and scalable data infrastructure for Slack’s analytics and decision-making. Key responsibilities include designing data services, ensuring reliability...
B Capital
San Francisco, CA
1 day ago
Staff Data Infrastructure Engineer
Slack Enterprise seeks a Staff Software Engineer to join its Data Infrastructure team. This role includes designing and building high-performance data systems that support analytics and machine learning needs. Candidates should have over 10 years of experience in software...
Slack Enterprise
San Francisco, CA
3 days ago
Staff Frontend Infrastructure Engineer
...superintelligence. To achieve this, we need more great engineers. The work affects millions of people... ...The Role We're looking for a frontend infrastructure engineer to build the tools and systems... ...that scale as the codebase grows As a staff engineer, you'll make decisions about...
Giga
San Francisco, CA
4 days ago
Staff Data Infrastructure Engineer - Analytics at Scale
100 Salesforce, Inc. is looking for a Staff Software Engineer to join the Data Infrastructure team. This role involves designing and operating reliable, scalable data infrastructure that supports analytics and machine learning workflows. The ideal candidate will have 10...
100 Salesforce, Inc.
San Francisco, CA
1 day ago
Senior Staff Network Reliability Engineer (Global GPU)
Crusoe in San Francisco is looking for a Senior Staff Network Operations Engineer to oversee the reliability of its global network. This role entails... ...a team of engineers in maintaining a high-performing infrastructure. The ideal candidate will have over 12 years of...
ProducePay
San Francisco, CA
2 days ago
Staff Network Reliability Engineer - Scale & Incident Response
$195k - $235k
Crusoe Energy Systems LLC is looking for a Staff Network Operations Engineer to ensure production reliability across its global network infrastructure. This role is critical in maintaining uptime and facilitating AI workloads via incident response and operational excellence...
Crusoe Energy Systems LLC
San Francisco, CA
14 hours ago
Senior Staff Network Reliability Engineer - Global Edge
$225k - $275k
Crusoe Energy Systems LLC in San Francisco is looking for a Senior Staff Network Operations Engineer to ensure production reliability across its global network. In this role, you will lead incident response and define key operational standards. Ideal candidates will bring...
Crusoe Energy Systems LLC
San Francisco, CA
3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Staff Infrastructure Engineer, Cluster Infrastructure. Be the first to apply!