Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Site Reliability Engineer

$131.76k - $161.06k

Lawrence Berkeley Lab

The National Energy Research Scientific Computing Center (NERSC) is hiring a Site Reliability Engineer to help ensure its HPC and data systems remain reliable, secure, and accessible for 11,000 scientific users. As part of a 24x7 operations team, you will use advanced monitoring and data systems to proactively maintain the health of NERSC's computing environment and support critical DOE scientific research.

We're here for the same mission, to bring science solutions to the world. Join our team and YOU will play a supporting role in our goal to address global challenges! Have a high level of impact and work for an organization associated with 17 Nobel Prizes!

Why join Berkeley Lab?

We invest in our employees by offering a total rewards package you can count on:

  • Exceptional health and retirement benefits, including pension or 401K-style plans

  • Opportunities to grow in your career - check out our Tuition Assistance Program

  • A culture where you'll belong - we are invested in our teams!

  • In addition to accruing vacation and sick time, we also have a Winter Holiday Shutdown every year.

  • Parental bonding leave (for both mothers and fathers)

  • Pet insurance

You will:

  • Work a 5-day schedule with 2-3 onsite operations shifts and 2-3 project days, rotating across day, swing, and overnight shifts as needed to monitor the NERSC HPC facility.

  • Monitor and respond to system, storage, network, and facility alerts, escalating issues when necessary.

  • Improve reliability through automation, process optimization, monitoring enhancements, and root-cause prevention.

  • Develop and maintain monitoring, alerting, and diagnostic tools, including integrations with HPC system APIs and ServiceNow.

  • Support 24/7 data collection and real-time diagnostics across critical infrastructure.

  • Contribute to Agentic AI solutions that automate workflows and improve operational efficiency.

  • Coordinate with NERSC teams on maintenance, workflows, and incident management.

  • Perform physical and logical data center inspections to ensure environmental and infrastructure health.

  • Maintain accurate incident and maintenance records in the ticketing system.

  • Analyze and resolve complex operational issues using sound technical judgment and collaboration with internal and external experts.

We are looking for:

  • Typically requires a minimum of 5 years of related experience with a Bachelor's degree; or 3 years and a Master's degree; or equivalent work experience.

  • Experience in or willingness to work within a 24/7 onsite team environment to support large-scale data centers or critical installations.

  • Experience on Linux shell and working in a command-line (e.g. SSH) environment.

  • Experience with developing tools using various programming languages such as C, C++, Perl, Java, or Python or a scripting language with knowledge of standard software development practices.

  • Motivated, self-starter who can learn technologies that improve data center management in areas like Kubernetes, Prometheus/VictoriaMetrics, Alertmanager, building management software, evaporative cooling, and power utilization.

  • Experience with network security: configuring/maintaining ACLs, knowledge of firewalls

  • Experience collaborating across technical teams to resolve operational bottlenecks and ensure system reliability and alignment with service-level objectives.

  • Knowledge of and ability to work on large data communications networks/ Network Protocols and IT infrastructure supporting highly available systems and applications.

Desired skills/knowledge:

  • Experience with ServiceNow implementation is a plus, particularly in architecting or deploying solutions for Incident Management, Change Management, or CMDB to improve IT workflows.

  • Practical experience in developing and deploying Agentic AI or autonomous automation tools to streamline technical tasks.

  • Familiarity with ITSM best practices and an understanding of how to align service lifecycles with business goals is preferred.

  • A certification in a system administration area in platforms, software, or any other advanced education in the Computing Science area.

  • ServiceNow certifications.

  • ITIL certifications.

Additional information:

  • Applications will be accepted until the job posting is removed.

  • Appointment type: This is a full-time, career appointment, exempt (monthly paid) from overtime pay.

  • Salary range: The expected salary for this position is $131,760 - $161,064, which fits into the full salary of $117,132 - $197,676 depending upon the candidate's skills, knowledge, and abilities. This includes education, certifications, and years of experience.

  • Background check: This position is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.

  • Work modality: This position requires substantial on-site presence, but is eligible for a flexible work mode, and hybrid schedules may be considered. Hybrid work is a combination of performing work on-site at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA and some telework. Individuals working a hybrid schedule must reside within 150 miles of Berkeley Lab. Work schedules are dependent on business needs. In rare cases, full-time telework or remote work modes may be considered.

Want to learn more about working at Berkeley Lab? Please visit: careers.lbl.gov

Equal Employment Opportunity Employer: The foundation of Berkeley Lab is our Stewardship Values: Team Science, Service, Trust, Innovation, and Respect; and we strive to build community with these shared values and commitments. Berkeley Lab is an Equal Opportunity Employer. We heartily welcome applications from all who could contribute to the Lab's mission of leading scientific discovery, excellence, and professionalism. In support of our rich global community, all qualified applicants will be considered for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, protected veteran status, or other protected categories under State and Federal law.

Misconduct Disclosure Requirement: As a condition of employment, the finalist will be required to disclose if they are subject to any final administrative or judicial decisions within the last seven years determining that they committed any misconduct, are currently being investigated for misconduct, left a position during an investigation for alleged misconduct, or have filed an appeal with a previous employer.

Required
Preferred
Job Industries
  • Other
Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Site Reliability Engineer in Berkeley, CA vacancy
  •  ...management. We have become a multibillion‑dollar asset manager, and we have ambitious goals for the future. As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering... 
    Suggested
    Local area

    The Voleon Group

    Berkeley, CA
    3 days ago
  • $163k - $203k

     ...will be a senior technical contributor on the SRE team, responsible for the reliability, scalability, and security of Prosper’s Cloud Platform portfolio. This is as much of a platform engineering role as it is SRE role — you will maintain the applications that run on our... 
    Suggested
    Work experience placement
    Work at office
    Local area
    Remote work
    Flexible hours
    2 days per week

    Prosper.com

    San Francisco, CA
    4 days ago
  • $104k - $130k

     ...infrastructure as well as help improve the reliability, quality of services and overall...  ...recovery.  You’ll collaborate or embed with engineering teams, helping them to improve the...  ...more about our locations by visiting our site. Compensation & Benefits The base... 
    Suggested
    Full time
    Work experience placement

    AppFolio

    San Francisco, CA
    3 days ago
  •  ...Job Description Velia Multiservices is proud to partner with a fast-growing, early-stage startup to identify a top-tier Site Reliability Engineer who will play a critical role in scaling and strengthening a high-performance platform used by enterprise clients such as... 
    Suggested

    Velia multiservices

    San Francisco, CA
    14 days ago
  •  ...The TeamPlatform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational...  ...fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper). As... 
    Suggested
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    San Francisco, CA
    3 days ago
  • $140k - $205k

     ...Senior Technology Site Reliability Engineer Cooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operationsteam. Position summary: The Senior Technology Site Reliability Engineer("SRE") is responsible for ensuring the reliability... 
    Full time
    Temporary work
    Work at office
    Flexible hours
    Weekend work

    Cooley

    San Francisco, CA
    3 days ago
  •  ...manifesto. About the Role We're looking for an Infrastructure Engineer to take the lead on scaling our operational resilience as we...  ...This is a high-impact, high-trust role where you’ll shape how reliability is done - reducing incident load, building internal tooling, and... 
    Worldwide
    Shift work

    Happyrobot Inc.

    San Francisco, CA
    5 days ago
  • $166.9k - $225.9k

    Job Summary Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit SRE team...  ...organization. What you’ll bring 6+ years of experience in Site Reliability Engineering, Cloud Engineering, or... 
    Flexible hours

    Drata

    San Francisco, CA
    1 day ago
  •  ...work from home day is currently Tuesday. Engineering at Lambda is responsible for building...  ...observability adoptable and improve product reliability. Lead members of other engineering teams...  ...in Go Have 5+ years of experience in Site Reliability Engineering practices Possess... 
    Work at office
    Local area
    Work from home

    Lambda

    San Francisco, CA
    4 days ago
  • $163k - $203k

     ...will be a senior technical contributor on the SRE team, responsible for the reliability, scalability, and security of Prosper’s Cloud Platform portfolio. This is as much of a platform engineering role as it is SRE role — you will maintain the applications that run on our... 
    Work experience placement
    Work at office
    Local area
    Remote work
    Flexible hours
    2 days per week

    Prosper

    San Francisco, CA
    1 day ago
  • $125k - $165k

    Position: Site Reliability Engineer Location: San Francisco, CA Job Id: 434 # of Openings: 1 TELCOR Inc, a leading innovator in laboratory software, is looking for a Site Reliability Engineer to join our TELCOR AI Systems team! Do you have strong experience in cloud... 
    Temporary work
    Work at office
    Visa sponsorship
    Work visa
    Relocation package
    Flexible hours

    TELCOR

    San Francisco, CA
    3 days ago
  •  ...and enthusiasm for building a great culture and product, you will find a home at Fieldguide. About the Role As a Senior Site Reliability Engineer (SRE) at Fieldguide, you will be responsible for ensuring the reliability, scalability, and observability of our production... 
    Remote work
    Work from home
    Flexible hours

    Fieldguide

    San Francisco, CA
    3 days ago
  • The role We're looking for a world-class Site Reliability Engineer to ensure the reliability, performance, and scalability of our AI infrastructure platform. You’ll be building and operating the core systems that power agentic AI at scale. Your mission: keep our ultra-... 

    Blaxel

    San Francisco, CA
    1 day ago
  • $140k - $185k

     ...alongside clinicians to make that possible. We’re a team of doctors, engineers, designers, researchers, and creatives building tools that...  ...in on-call and incident response: Improve operational reliability: Own parts of the production environment: Strengthen observability... 
    Work at office
    Worldwide

    Dormont Manufacturing Co

    San Francisco, CA
    5 days ago
  •  ...in the design of information and operational support systems. Required Skills/Qualifications BS/MS degree in Computer Science, Engineering, or a related subject. Equivalent experience accepted. Proven working experience in installing, configuring, and troubleshooting... 
    Work experience placement
    Start working today
    Remote work
    Flexible hours

    Hamilton Barnes Associates Limited

    San Francisco, CA
    3 days ago
  •  ...advanced algorithms that significantly outperforms individual engineers. We combine language models with human ingenuity to push the...  ...and quality. The Role We are seeking an experienced Site Reliability Engineer to join our Platform Engineering team in the Bay Area... 

    CodeRabbit

    San Francisco, CA
    5 days ago
  •  ...Responsibilities Lead and onboard services and teams to the reliability tenets. Establish and maintain Service Level Objectives (...  ...Science or equivalent. 6+ years of experience in Site Reliability Engineering, managing infrastructure and services at scale. History of... 

    OutSystems, Inc.

    San Francisco, CA
    5 days ago
  • $165k - $225k

     ...and the SDF team is expanding to support the rapidly growing and changing Stellar ecosystem. SDF is looking for a Senior Site Reliability Engineer to help build and operate the foundation that powers our engineering teams. You’ll ensure the reliability and scalability... 
    Temporary work
    Work at office
    Local area
    Worldwide
    Flexible hours

    Stellar

    San Francisco, CA
    3 days ago
  • $125k - $165k

    Position Site Reliability Engineer Location Lincoln, NE, San Francisco, CA, or Remote Job ID 434 Openings 1 Job Summary The Site Reliability Engineer will help ensure the reliability, scalability, and performance of the systems that power our AI products. This role... 
    Temporary work
    Remote work
    Visa sponsorship
    Work visa
    Flexible hours

    TELCOR Inc

    San Francisco, CA
    3 days ago
  •  ...shape the future of healthcare, we’d love to meet you. About the role We’re hiring an SRE to join our engineering team at Plenful and take ownership of the reliability and performance of the systems that power our product. You’ll work across our distributed workflow... 
    Work at office
    Remote work
    Flexible hours
    2 days per week

    Plenful

    San Francisco, CA
    1 day ago
  •  ...co-founders with PhDs in AI, Math, and Computer Science — is poised to redefine computing. About the Role We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and... 

    Hyperbolic Labs

    San Francisco, CA
    2 days ago
  •  ...customer acquisition, and Connor was a machine learning research engineer at Scale AI. The rest of our team comes from companies like...  ...of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data monthly and... 

    Unify

    San Francisco, CA
    5 days ago
  • $60 per hour

    Senior Site Reliability Engineer (Copy) Seattle Hybrid (Hybrid location). Full-time. About Us Supio is a trusted AI platform purpose-built for law firms, reshaping how data drives impactful outcomes. Our innovative approach blends technology with deep legal expertise,... 
    Full time
    Work at office
    Flexible hours

    Bonfirevc

    San Francisco, CA
    5 days ago
  • # Senior Site Reliability EngineerHybrid - San Francisco**Our Mission & Values:** At Drata, we help companies earn and keep the trust of...  ...**Job Summary:**Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part... 
    Work at office
    Immediate start
    Worldwide
    Monday to Friday
    Flexible hours

    Careers at Drata

    San Francisco, CA
    1 day ago
  • $175k - $250k

     ...fully distributed across North American time zones and supports a fast‑growing customer base of SaaS companies. About the Site Reliability Engineering Team The Site Reliability Engineering (SRE) team ensures the WorkOS platform remains fast, reliable, and resilient at... 
    Remote work

    I did my part and supported the Regular Toilet

    San Francisco, CA
    5 days ago
  • What you’ll do As a Senior Site Reliability Engineer, you’ll work closely with product teams in Spend to deliver and maintain scalable, reliable cloud infrastructure in support of key product initiatives. Aligned to the roadmap, you’ll lead on infrastructure design and... 

    Airwallex-

    San Francisco, CA
    4 days ago
  • $210k - $300k

     ...Site Reliability Engineer (SRE) / DevOps Engineer Location: Onsite in NYC or San Francisco Compensation: $210,000–$300,000 Base Salary About the Role We are seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer to help build, scale, and operate... 

    TechLine Consulting

    Alameda, CA
    2 days ago
  • $150k

     ...Job Description Job Description About The Role We are seeking an experienced Site Reliability Engineer (SRE) with a strong focus on DevSecOps to join our growing engineering team. In this role, you will oversee and maintain the reliability, security posture, and... 

    VantageScore

    San Francisco, CA
    18 days ago
  • $210.6k - $305.1k

     ...Qualifications: ~ You have led a distributed team of 5+ engineers, can demonstrate strong technical vision for your team, and ensure...  ..., and basic life insurance. Please see the Cisco careers site to discover more benefits and perks. Employees may be eligible... 
    Full time
    Temporary work
    Local area
    Flexible hours

    Cisco

    San Francisco, CA
    4 days ago
  • $227.2k - $324.5k

     ...About the Role: Site Reliability Engineering (SRE) at Tubi is not a traditional operations team. We are a software engineering organization that applies a developer's mindset and toolkit to the challenges of building and running large-scale, distributed systems.... 
    Full time
    Contract work
    Temporary work
    Local area
    Flexible hours

    Tubi

    San Francisco, CA
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer. Be the first to apply!