Site Reliability Engineer, Metal

$100k

Tenstorrent

Site Reliability Engineer, Metal

Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities.

Tenstorrent is building large-scale AI systems across internal clusters and customer deployments. This role sits at the intersection of site reliability, infrastructure operations, and customer engineering, ensuring our systems are reliable, observable, and production-ready.

This role is hybrid, based out of Toronto, ON; Austin, TX; or Santa Clara, CA.

We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting.

Who You Are

Experienced in site reliability, infrastructure, or systems engineering in distributed environments.
Strong Linux systems knowledge with the ability to troubleshoot complex multi-layer issues.
Proficient with observability tools such as Prometheus, Grafana, and alerting systems.
Comfortable with scripting and automation using Python, Go, or similar languages.
Solid understanding of networking fundamentals and how systems behave at scale.

What We Need

Ensure reliability and operational health of Tenstorrent systems across internal and customer environments.
Troubleshoot complex issues across compute, networking, and software layers.
Partner with engineering teams and customers to resolve production incidents.
Design and improve monitoring, observability, and alerting systems.
Build automation to reduce operational toil and improve system reliability.

What You Will Learn

How large-scale AI infrastructure is operated across internal clusters and customer deployments.
How distributed systems behave under real-world production conditions.
How observability and automation drive reliability at scale.
How hardware, networking, and software systems interact in AI environments.
How customer-facing AI infrastructure is deployed, supported, and optimized.

Compensation for all engineers at Tenstorrent ranges from $100k - $500k including base and variable compensation targets. Experience, skills, education, background and location all impact the actual offer made.

Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer.

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Site Reliability Engineer, Metal in Santa Clara, CA vacancy

Software Engineer, Metal Runtime (Core Systems)
$100k
...Software Engineer, Metal Runtime (Core Systems) Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations...
Suggested
Tenstorrent
Santa Clara, CA
11 days ago
Software Engineer, Metal Runtime (API & Abstractions)
$100k
...Software Engineer, Metal Runtime (API & Abstractions) Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify...
Suggested
Tenstorrent
Santa Clara, CA
16 days ago
Director, Site Reliability Engineering Sunnyvale, CA , USA
$250k
...systems, eGain provides the single source of truth—explainable, reliable, and maintainable—that serves as the repository for all... ...at scale. Position Overview As Director of Site Reliability Engineering, you will ensure that eGain’s AI knowledge management platform...
Suggested
Work at office
eGain Corporation
Sunnyvale, CA
2 days ago
Site Reliability Engineer
$145k - $165k
...Your Ego : Selflessly collaborate towards our shared purpose. About the role Bolt Graphics is seeking a highly experienced Site Reliability Engineer (SRE) to design, build, and operate highly reliable developer and production systems. This role is mission-critical to...
Suggested
Work at office
Immediate start
Bolt Graphics, Inc.
Sunnyvale, CA
3 days ago
Senior Site Reliability Engineer, AIOPs
...building an AI Data Center AIOps platform that turns raw, high‑volume telemetry into reliable, job‑centric insights and automation for GPU fleets. Join our team of innovative engineers who are building this platform and operating it (not the compute cluster): uptime, performance...
Suggested
NVIDIA Gruppe
Santa Clara, CA
3 days ago
Senior Site Reliability Engineer - HPC
$152k - $241.5k
...infrastructure platforms for automated host lifecycle management, fleet reliability/auto‑healing, E2E observability or data‑driven operations (... ...languages such as Python, Go, Perl, or Ruby. Mentored other engineers and influenced technical direction through design reviews,...
NVIDIA Gruppe
Santa Clara, CA
4 days ago
Site Reliability Engineer
...Site Reliability Engineer Onsite - Bay Area, CA Skills Relevant Skills and Experience What You’ll Do (Day-to-Day) Own and manage our cloud infrastructure (GCP or AWS, on-prem). Build, maintain, and optimize Kubernetes clusters (including GPU-backed clusters). Implement...
Amiri Recruiting
Santa Clara, CA
23 hours ago
Sr. Site Reliability Engineer
...that keep the world running. Location: 5 on-site days a week in Sunnyvale, CA Headquarters. Our Team's Vision: Our Engineering team is shaping the future of cybersecurity... ...are looking for an experienced Senior Site Reliability Engineer (SRE) with a strong background in...
Work experience placement
Illumio
Sunnyvale, CA
3 days ago
Site Reliability Engineer
$170k - $200k
...Site Reliability Engineer We are seeking a talented and motivated Site Reliability Engineer to join our engineering team. You will be responsible for building, maintaining, and troubleshooting cloud service/cluster, infrastructure, and monitoring systems to ensure high...
Full time
Worldwide
Edelman
Sunnyvale, CA
12 hours ago
Site Reliability Engineer (Sunnyvale)
...by customizing MES tool per business needs Education Requirements, Ideal Experience: Associate's degree in Industrial Engineering or IT related field Minimum of 0-3 years' relevant experience Experience in C#, Delphi desired Knowledge of the application...
Work at office
Foxconn Industrial Internet
Sunnyvale, CA
1 day ago
Senior Site Reliability Engineer - HPC
$152k - $241.5k
Overview NVIDIA is looking for a Senior Site Reliability Engineer (SRE) to join our Compute Farm team and help build the next generation of our global services platform. The role focuses on keeping critical systems operational while leveraging AI technologies to deliver...
NVIDIA Corporation
Santa Clara, CA
2 days ago
Site Reliability Engineer - $150,000-$195,000 Fortinet
$150k - $195k
...milestones so that scale and resiliency are a part of every conversation. Develop best practices alongside engineering/operations teams to improve the scalability and reliability of internal processes. Participate in an on-call rotation. Minimum Qualifications 3 years of Devops...
Full time
Worldwide
(ISC)2 East Bay Chapter
Sunnyvale, CA
4 days ago
Senior Manager, Site Reliability Engineering
$200k - $322k
...environment, where NVIDIANs are inspired to excel and make a profound global impact. NVIDIA is seeking a Senior Manager of Site Reliability Engineering to lead and reshape how IT operations function at scale. This role goes beyond traditional service management to build...
NVIDIA Gruppe
Santa Clara, CA
3 days ago
Principal Site Reliability Engineer (AIOps)
$151.6k - $245.3k
...Job Summary Palo Alto Networks runs a large hybrid infrastructure and is one of the largest GCP customers. As a Site Reliability Engineer, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance...
Palo Alto Networks
Santa Clara, CA
4 days ago
Site Reliability Engineer, Enterprise Technology Services
$147.4k - $272.1k
Site Reliability Engineer, Enterprise Technology Services Sunnyvale, California, United States Software and Services Imagine what we could do together. At Apple, new ideas have a way of becoming excellent products, services, and customer experiences very quickly. Bring...
Relocation
Apple Inc.
Sunnyvale, CA
2 days ago
Site Reliability Engineer, Customer Systems
$147.4k - $220.9k
Site Reliability Engineer, Customer Systems Sunnyvale, California, United States Software and Services Imagine what you could do here. Apple is a place where extraordinary people gather to do their best work. Together we craft products and experiences people once couldn...
Relocation
Apple Inc.
Sunnyvale, CA
2 days ago
Principal Site Reliability Engineer (CIPE)
...Job Summary Note: This role requires US Citizenship. Your Career As a Principal Site Reliability Engineer, you will serve as the technical authority for our cloud-native infrastructure. You’re responsible for architecting the reliability, scalability, and security of a...
Visa sponsorship
Work visa
Shift work
Palo Alto Networks
Santa Clara, CA
4 days ago
Senior Software Engineer, Site Reliability Engineering
$174k - $252k
Senior Software Engineer, Site Reliability Engineering X Applicants in San Francisco: Qualified applications with arrest or conviction records will be considered for employment in accordance with the San Francisco Fair Chance Ordinance for Employers and the California...
Full time
Google Inc.
Sunnyvale, CA
4 days ago
Principal Site Reliability Engineer
$202k - $247k
...Principal Site Reliability Engineer At FortiCNAPP At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess over getting the...
Full time
Worldwide
Edelman
Santa Clara, CA
12 hours ago
Staff Site Reliability Engineer (SRE) | Dev Ops Engineer
$169k - $224k
...Staff Site Reliability / DevOps Engineer GRAIL is seeking a Staff Site Reliability / DevOps Engineer to lead the reliability, scalability, and security of our cloud-native platform. This role operates at the intersection of infrastructure engineering, platform strategy...
Full time
Work at office
Local area
Flexible hours
Shift work
GRAIL, Inc.
Sunnyvale, CA
1 day ago
Staff Site Reliability Engineer
$175k - $250k
...Staff Site Reliability Engineer Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home...
Full time
Figure
Sunnyvale, CA
1 day ago
Principal Site Reliability Engineer ( U.S Citizenship required )
$151.6k - $245.3k
.... About the Role Palo Alto Networks runs a large infrastructure and is one of the largest GCP customers. As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experience Management) team, you will be part of a team supporting the services that provide...
Full time
Work at office
Visa sponsorship
Work visa
Palo Alto Networks, Inc.
Santa Clara, CA
2 days ago
Staff Site Reliability Engineer
$150k - $180k
...looking for candidates to serve as software-focused Senior Site Reliability Engineer at Verrus. This is a full‑time position based out of the Mountain... ...design. Preferred Qualifications Experience with bare‑metal provisioning or hybrid‑cloud environments. Familiarity...
Full time
Work at office
Local area
Flexible hours
Verrus, LLC
Mountain View, CA
3 days ago
Site Reliability Engineer - Hardware Infrastructure
At NVIDIA, Site Reliability Engineering provides a rare chance to define, develop, and support large-scale production systems with high efficiency and availability. This demanding position merges software and systems engineering efforts to guarantee flawless service operation...
NVIDIA Corporation
Santa Clara, CA
3 days ago
Site Reliability Engineer (Edge Services), Infrastructure Services
$147.4k - $272.1k
Site Reliability Engineer (Edge Services), Infrastructure Services Sunnyvale, California, United States Software and Services We are seeking a proactive Site Reliability Engineer to champion the evolution of our production ecosystems. In this role, you will help drive...
Relocation
Shift work
Apple Inc.
Sunnyvale, CA
2 days ago
Senior Staff Site Reliability Engineer
$126k - $204.5k
...As part of this role, you will collaborate closely with our engineering teams to develop innovative solutions that provide clear and... ...team to influence the operability of the product and ensure the reliability and availability of our services. Qualifications Required...
Palo Alto Networks, Inc.
Santa Clara, CA
2 days ago
Sr Site Reliability Engineer (Internet Security Platform)
$120.3k - $194.53k
...drives great outcomes. Job Summary Palo Alto Networks runs a large hybrid infrastructure across multiple public clouds. As a Site Reliability Engineer on the Internet Security Platform team, you will be part of a team supporting Advanced DNS Security services. This...
Full time
Work at office
Visa sponsorship
Work visa
Palo Alto Networks
Santa Clara, CA
4 days ago
Site Reliability Engineer - Senior Staff
$118k - $170k
...customers globally. You will help improve reliability, automation, observability, and customer... ...and SRE technologies in a collaborative engineering environment. How Youll help us... ...is looking for a customer focused Senior Site Reliability Engineer (SRE) to help improve...
Work at office
Relocation
3 days per week
Vistance Networks
Sunnyvale, CA
2 days ago
Site Reliability Engineer II Cloud Security & Networking
...A leading cybersecurity firm is seeking a Senior Backend Software Engineer to focus on the Azure Firewall Management Program. This position requires coding experience in Go / Golang and familiarity with cloud environments like AWS or Azure. You will work on integrating...
Work at office
Illumio
Sunnyvale, CA
3 days ago
Senior Site Reliability Engineer — Scale, Automation & Uptime
$145k - $165k
A technology solutions firm in Sunnyvale, CA is looking for a highly experienced Site Reliability Engineer (SRE). This role involves maintaining uptime and performance across systems. Exceptional Linux expertise and automation skills in Bash and Python are crucial. Key...
Bolt Graphics, Inc.
Sunnyvale, CA
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer, Metal. Be the first to apply!