Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Site Reliability Engineer, Metal

$100k

Tenstorrent

Site Reliability Engineer, Metal

Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities.

Tenstorrent is building large-scale AI systems across internal clusters and customer deployments. This role sits at the intersection of site reliability, infrastructure operations, and customer engineering, ensuring our systems are reliable, observable, and production-ready.

This role is hybrid, based out of Toronto, ON; Austin, TX; or Santa Clara, CA.

We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting.

Who You Are
  • Experienced in site reliability, infrastructure, or systems engineering in distributed environments.
  • Strong Linux systems knowledge with the ability to troubleshoot complex multi-layer issues.
  • Proficient with observability tools such as Prometheus, Grafana, and alerting systems.
  • Comfortable with scripting and automation using Python, Go, or similar languages.
  • Solid understanding of networking fundamentals and how systems behave at scale.
What We Need
  • Ensure reliability and operational health of Tenstorrent systems across internal and customer environments.
  • Troubleshoot complex issues across compute, networking, and software layers.
  • Partner with engineering teams and customers to resolve production incidents.
  • Design and improve monitoring, observability, and alerting systems.
  • Build automation to reduce operational toil and improve system reliability.
What You Will Learn
  • How large-scale AI infrastructure is operated across internal clusters and customer deployments.
  • How distributed systems behave under real-world production conditions.
  • How observability and automation drive reliability at scale.
  • How hardware, networking, and software systems interact in AI environments.
  • How customer-facing AI infrastructure is deployed, supported, and optimized.

Compensation for all engineers at Tenstorrent ranges from $100k - $500k including base and variable compensation targets. Experience, skills, education, background and location all impact the actual offer made.

Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer.

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Site Reliability Engineer, Metal in Santa Clara, CA vacancy
  • $100k

     ...Software Engineer, Metal Runtime (Core Systems) Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations... 
    Suggested

    Tenstorrent

    Santa Clara, CA
    11 days ago
  • $100k

     ...Software Engineer, Metal Runtime (API & Abstractions) Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify... 
    Suggested

    Tenstorrent

    Santa Clara, CA
    16 days ago
  • $250k

     ...systems, eGain provides the single source of truth—explainable, reliable, and maintainable—that serves as the repository for all...  ...at scale. Position Overview As Director of Site Reliability Engineering, you will ensure that eGain’s AI knowledge management platform... 
    Suggested
    Work at office

    eGain Corporation

    Sunnyvale, CA
    2 days ago
  • $145k - $165k

     ...Your Ego : Selflessly collaborate towards our shared purpose. About the role Bolt Graphics is seeking a highly experienced Site Reliability Engineer (SRE) to design, build, and operate highly reliable developer and production systems. This role is mission-critical to... 
    Suggested
    Work at office
    Immediate start

    Bolt Graphics, Inc.

    Sunnyvale, CA
    3 days ago
  •  ...building an AI Data Center AIOps platform that turns raw, high‑volume telemetry into reliable, job‑centric insights and automation for GPU fleets. Join our team of innovative engineers who are building this platform and operating it (not the compute cluster): uptime, performance... 
    Suggested

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $152k - $241.5k

     ...infrastructure platforms for automated host lifecycle management, fleet reliability/auto‑healing, E2E observability or data‑driven operations (...  ...languages such as Python, Go, Perl, or Ruby. Mentored other engineers and influenced technical direction through design reviews,... 

    NVIDIA Gruppe

    Santa Clara, CA
    4 days ago
  •  ...Site Reliability Engineer Onsite - Bay Area, CA Skills Relevant Skills and Experience What You’ll Do (Day-to-Day) Own and manage our cloud infrastructure (GCP or AWS, on-prem). Build, maintain, and optimize Kubernetes clusters (including GPU-backed clusters). Implement... 

    Amiri Recruiting

    Santa Clara, CA
    23 hours ago
  •  ...that keep the world running. Location: 5 on-site days a week in Sunnyvale, CA Headquarters. Our Team's Vision: Our Engineering team is shaping the future of cybersecurity...  ...are looking for an experienced Senior Site Reliability Engineer (SRE) with a strong background in... 
    Work experience placement

    Illumio

    Sunnyvale, CA
    3 days ago
  • $170k - $200k

     ...Site Reliability Engineer We are seeking a talented and motivated Site Reliability Engineer to join our engineering team. You will be responsible for building, maintaining, and troubleshooting cloud service/cluster, infrastructure, and monitoring systems to ensure high... 
    Full time
    Worldwide

    Edelman

    Sunnyvale, CA
    12 hours ago
  •  ...by customizing MES tool per business needs Education Requirements, Ideal Experience: Associate's degree in Industrial Engineering or IT related field Minimum of 0-3 years' relevant experience Experience in C#, Delphi desired Knowledge of the application... 
    Work at office

    Foxconn Industrial Internet

    Sunnyvale, CA
    1 day ago
  • $152k - $241.5k

    Overview NVIDIA is looking for a Senior Site Reliability Engineer (SRE) to join our Compute Farm team and help build the next generation of our global services platform. The role focuses on keeping critical systems operational while leveraging AI technologies to deliver... 

    NVIDIA Corporation

    Santa Clara, CA
    2 days ago
  • $150k - $195k

     ...milestones so that scale and resiliency are a part of every conversation. Develop best practices alongside engineering/operations teams to improve the scalability and reliability of internal processes. Participate in an on-call rotation. Minimum Qualifications 3 years of Devops... 
    Full time
    Worldwide

    (ISC)2 East Bay Chapter

    Sunnyvale, CA
    4 days ago
  • $200k - $322k

     ...environment, where NVIDIANs are inspired to excel and make a profound global impact. NVIDIA is seeking a Senior Manager of Site Reliability Engineering to lead and reshape how IT operations function at scale. This role goes beyond traditional service management to build... 

    NVIDIA Gruppe

    Santa Clara, CA
    3 days ago
  • $151.6k - $245.3k

     ...Job Summary Palo Alto Networks runs a large hybrid infrastructure and is one of the largest GCP customers. As a Site Reliability Engineer, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance... 

    Palo Alto Networks

    Santa Clara, CA
    4 days ago
  • $147.4k - $272.1k

    Site Reliability Engineer, Enterprise Technology Services Sunnyvale, California, United States Software and Services Imagine what we could do together. At Apple, new ideas have a way of becoming excellent products, services, and customer experiences very quickly. Bring... 
    Relocation

    Apple Inc.

    Sunnyvale, CA
    2 days ago
  • $147.4k - $220.9k

    Site Reliability Engineer, Customer Systems Sunnyvale, California, United States Software and Services Imagine what you could do here. Apple is a place where extraordinary people gather to do their best work. Together we craft products and experiences people once couldn... 
    Relocation

    Apple Inc.

    Sunnyvale, CA
    2 days ago
  •  ...Job Summary Note: This role requires US Citizenship. Your Career As a Principal Site Reliability Engineer, you will serve as the technical authority for our cloud-native infrastructure. You’re responsible for architecting the reliability, scalability, and security of a... 
    Visa sponsorship
    Work visa
    Shift work

    Palo Alto Networks

    Santa Clara, CA
    4 days ago
  • $174k - $252k

    Senior Software Engineer, Site Reliability Engineering X Applicants in San Francisco: Qualified applications with arrest or conviction records will be considered for employment in accordance with the San Francisco Fair Chance Ordinance for Employers and the California... 
    Full time

    Google Inc.

    Sunnyvale, CA
    4 days ago
  • $202k - $247k

     ...Principal Site Reliability Engineer At FortiCNAPP At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess over getting the... 
    Full time
    Worldwide

    Edelman

    Santa Clara, CA
    12 hours ago
  • $169k - $224k

     ...Staff Site Reliability / DevOps Engineer GRAIL is seeking a Staff Site Reliability / DevOps Engineer to lead the reliability, scalability, and security of our cloud-native platform. This role operates at the intersection of infrastructure engineering, platform strategy... 
    Full time
    Work at office
    Local area
    Flexible hours
    Shift work

    GRAIL, Inc.

    Sunnyvale, CA
    1 day ago
  • $175k - $250k

     ...Staff Site Reliability Engineer Figure is an AI robotics company developing autonomous general-purpose humanoid robots. The goal of the company is to ship humanoid robots with human level intelligence. Its robots are engineered to perform a variety of tasks in the home... 
    Full time

    Figure

    Sunnyvale, CA
    1 day ago
  • $151.6k - $245.3k

     .... About the Role Palo Alto Networks runs a large infrastructure and is one of the largest GCP customers. As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experience Management) team, you will be part of a team supporting the services that provide... 
    Full time
    Work at office
    Visa sponsorship
    Work visa

    Palo Alto Networks, Inc.

    Santa Clara, CA
    2 days ago
  • $150k - $180k

     ...looking for candidates to serve as software-focused Senior Site Reliability Engineer at Verrus. This is a full‑time position based out of the Mountain...  ...design. Preferred Qualifications Experience with bare‑metal provisioning or hybrid‑cloud environments. Familiarity... 
    Full time
    Work at office
    Local area
    Flexible hours

    Verrus, LLC

    Mountain View, CA
    3 days ago
  • At NVIDIA, Site Reliability Engineering provides a rare chance to define, develop, and support large-scale production systems with high efficiency and availability. This demanding position merges software and systems engineering efforts to guarantee flawless service operation... 

    NVIDIA Corporation

    Santa Clara, CA
    3 days ago
  • $147.4k - $272.1k

    Site Reliability Engineer (Edge Services), Infrastructure Services Sunnyvale, California, United States Software and Services We are seeking a proactive Site Reliability Engineer to champion the evolution of our production ecosystems. In this role, you will help drive... 
    Relocation
    Shift work

    Apple Inc.

    Sunnyvale, CA
    2 days ago
  • $126k - $204.5k

     ...As part of this role, you will collaborate closely with our engineering teams to develop innovative solutions that provide clear and...  ...team to influence the operability of the product and ensure the reliability and availability of our services. Qualifications Required... 

    Palo Alto Networks, Inc.

    Santa Clara, CA
    2 days ago
  • $120.3k - $194.53k

     ...drives great outcomes. Job Summary Palo Alto Networks runs a large hybrid infrastructure across multiple public clouds. As a Site Reliability Engineer on the Internet Security Platform team, you will be part of a team supporting Advanced DNS Security services. This... 
    Full time
    Work at office
    Visa sponsorship
    Work visa

    Palo Alto Networks

    Santa Clara, CA
    4 days ago
  • $118k - $170k

     ...customers globally. You will help improve reliability, automation, observability, and customer...  ...and SRE technologies in a collaborative engineering environment. How Youll help us...  ...is looking for a customer focused Senior Site Reliability Engineer (SRE) to help improve... 
    Work at office
    Relocation
    3 days per week

    Vistance Networks

    Sunnyvale, CA
    2 days ago
  •  ...A leading cybersecurity firm is seeking a Senior Backend Software Engineer to focus on the Azure Firewall Management Program. This position requires coding experience in Go / Golang and familiarity with cloud environments like AWS or Azure. You will work on integrating... 
    Work at office

    Illumio

    Sunnyvale, CA
    3 days ago
  • $145k - $165k

    A technology solutions firm in Sunnyvale, CA is looking for a highly experienced Site Reliability Engineer (SRE). This role involves maintaining uptime and performance across systems. Exceptional Linux expertise and automation skills in Bash and Python are crucial. Key... 

    Bolt Graphics, Inc.

    Sunnyvale, CA
    1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer, Metal. Be the first to apply!