Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Reliability Lead, Common Services

$206k - $303k
Full-time

CoreWeave

CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and global enterprises, CoreWeave combines superior infrastructure performance with deep technical expertise to accelerate breakthroughs and turn compute into capability. Founded in 2017, CoreWeave became a publicly traded company (Nasdaq: CRWV) in March 2025. Learn more at [

WHAT YOU’LL DO:

The Common Services organization at CoreWeave is responsible for the shared platforms, APIs, and foundational services that power our AI cloud products and internal engineering teams. From authentication and authorization to core platform primitives and developer experience tooling, this organization ensures that the rest of CoreWeave can build, ship, and operate reliably at scale. As Reliability Lead, Common Services, you will establish and lead the Reliability Engineering and production operations practice for this organization. You’ll partner closely with engineering leaders and teams across Common Services to define how we build, release, monitor, and operate critical services—raising the bar on reliability, availability, and operational excellence across the board.

ABOUT THE ROLE:

As Reliability Lead, Common Services, you will be responsible for defining the reliability strategy, processes, and standards for the Common Services portfolio and driving consistent, high-quality operational practices across multiple teams. You’ll monitor production incidents within Common Services, and work directly with your partner teams to design systems that are reliable, observable, and supportable. Your day-to-day will blend hands-on technical work and cross-functional leadership to drive continuous improvement of Common Services production operations. In this role, you will: * Establish and lead the SRE / production engineering practice for the Common Services organization, including standards for reliability, incident management, and on-call, in partnership with the central Product Engineering organization. * Develop an Operational Excellence strategy that focuses on not only improving system performance but also monitoring and reducing operational toil * Partner with engineering and product teams to define SLOs, SLIs, and error budgets for critical Common Services, and ensure these become part of how teams plan and make tradeoffs. * Own and improve the incident management lifecycle for Common Services, including on-call rotations, escalation paths, incident tooling, post-incident reviews, and follow-through on corrective actions. * Drive the observability strategy (metrics, logs, traces, dashboards, alerts) for Common Services, ensuring we have actionable visibility into the health, performance, and capacity of key systems. * Collaborate with engineering leads to design and review architectures for reliability, scalability, resilience, and operability, including failure modes, redundancy, and graceful degradation. * Lead efforts to automate and harden operational workflows, including deployments, rollbacks, configuration management, change management, and routine maintenance tasks. * Build strong, trust-based relationships with partner teams and stakeholders, becoming a go-to leader for production readiness and operational risk within Common Services. * Hire, mentor, and develop SRE and production engineering talent, fostering a culture of continuous improvement, learning from incidents, and humane on-call. * Partner with other SRE and production engineering leaders across CoreWeave to align on global practices, tools, and reliability goals, representing the needs and constraints of Common Services.

WHO YOU ARE:

* 7+ years of experience in Site Reliability Engineering, Production Engineering, or similar roles working on distributed systems or cloud/platform services. * 2+ years of technical leadership experience (team lead, staff/principal engineer, or people manager) where you drove reliability and operational improvements across multiple services or teams. * Strong background in Linux-based production environments, containers, and orchestration technologies (e.g., Kubernetes), including debugging complex issues in live systems. * Hands-on experience with observability stacks (metrics, logging, tracing) and alerting systems, and a track record of designing meaningful SLIs/SLOs and alert strategies. * Proven experience running on-call rotations and incident response, including leading high-severity incidents and driving high-quality post-incident reviews. * Demonstrated ability to design for reliability (capacity planning, redundancy, failover, backoff, circuit breaking, graceful degradation, etc.) in large-scale or mission-critical systems. * Comfortable working with infrastructure-as-code and automation tooling (e.g., Terraform, Ansible, Helm, CI/CD pipelines) to make operations repeatable, auditable, and safe. * Strong cross-functional communication skills—you can translate between engineering, product, and business stakeholders and influence without relying solely on authority. * A bias toward data-driven decision making, using production data, capacity signals, and incident trends to inform priorities and investments.

PREFERRED:

* Background working with GPU workloads, high-performance computing, or latency/throughput-sensitive systems. * Experience with multi-tenant, multi-region, or highly regulated environments, and the associated reliability considerations. * Familiarity with service ownership models and strong opinions on how to align ownership, on-call, and accountability in a scalable way. * Experience mentoring or managing senior engineers and building high-performing teams through coaching, feedback, and clear expectations.

WONDERING IF YOU’RE A GOOD FIT?

We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match. Here are a few qualities we’ve found compatible with our team. If some of this describes you, we’d love to talk. * You care deeply about operational excellence and see reliability as a product feature, not an afterthought. * You’re excited by the challenge of bringing order and clarity to complex, rapidly evolving systems. * You’re passionate about building humane, sustainable on-call practices and learning from incidents without blame. * You enjoy partnering with multiple teams, influencing through context and clarity rather than authority alone. * You’re curious about how to run large-scale, GPU-intensive workloads reliably and efficiently in production.

WHY COREWEAVE?

At CoreWeave, we work hard, have fun, and move fast! We’re in an exciting stage of hyper-growth that you will not want to miss out on. We’re not afraid of a little chaos, and we’re constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:
  • Be Curious at Your Core
  • Act Like an Owner
  • Empower Employees
  • Deliver Best-in-Class Client Experiences
  • Achieve More Together
We support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us! The base salary range for this role is $206,000 to $303,000. The starting salary will be determined based on job-related knowledge, skills, experience, and market location. We strive for both market alignment and internal equity when determining compensation. In addition to base salary, our total rewards package includes a discretionary bonus, equity awards, and a comprehensive benefits program (all based on eligibility). What We Offer The range we’ve posted represents the typical compensation range for this role. To determine actual compensation, we review the market rate for each candidate which can include a variety of factors. These include qualifications, experience, interview performance, and location. In addition to a competitive salary, we offer a variety of benefits to support your needs. The benefits below reflect our US-based offerings; for roles in other locations, benefits vary and are shared during the hiring process. These include:
  • Medical, dental, and vision insurance - 100% paid for by CoreWeave
  • Company-paid Life Insurance
  • Voluntary supplemental life insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Health Savings Account
  • Tuition Reimbursement
  • Ability to Participate in Employee Stock Purchase Program (ESPP)
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in our office and data center locations
  • A casual work environment
  • A work culture focused on innovative disruption
California Applicants California Consumer Privacy Act [ Equal Opportunity & Accommodations CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace. All qualified applicants and candidates will receive consideration for employment without regard to race, color, religion, sex, disability, age, sexual orientation, gender identity, national origin, veteran status, or genetic information. As part of this commitment and consistent with the Americans with Disabilities Act (ADA) [ CoreWeave will ensure that qualified applicants and candidates with disabilities are provided reasonable accommodations for the hiring process, unless such accommodation would cause an undue hardship. If reasonable accommodation is needed, please contact: View email address on click.appcast.io [View email address on click.appcast.io]. Export Control Compliance This position requires access to export controlled information. To conform to U.S. Government export regulations applicable to that information, applicant must either be (A) a U.S. person, defined as a (i) U.S. citizen or national, (ii) U.S. lawful permanent resident (green card holder), (iii) refugee under 8 U.S.C. § 1157, or (iv) asylee under 8 U.S.C. § 1158, (B) eligible to access the export controlled information without a required export authorization, or (C) eligible and reasonably likely to obtain the required export authorization from the applicable U.S. government agency. CoreWeave may, for legitimate business reasons, decline to pursue any export licensing process.

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Reliability Lead, Common Services in Sunnyvale, CA vacancy
  •  ...Customer Service Lead The Customer Service Lead (CSL) at Dunn-Edwards assists the store management...  ...Dunn-Edwards store vision of fast and reliable service. The Customer Service Lead...  ...units of measure, using whole numbers, common fractions, and decimals. Ability to compute... 
    Suggested
    Shift work

    Dunn Edwards

    Cupertino, CA
    3 days ago
  • $201.6k - $302k

    Job Description The Role: As the Senior Engineering Manager for Hybrid Services & Reliability (HSR) within AV Core Infrastructure (ACI) at GM, you are the architect of our system trust. You will lead a newly seeded team responsible for the measurable availability of the... 
    Suggested
    Local area
    Remote work
    Work from home
    Relocation
    Relocation package
    Flexible hours

    General Motors

    Sunnyvale, CA
    5 days ago
  • $262k - $365k

    Google Inc. is seeking a Senior Staff Software Engineer, specializing in Site Reliability Engineering. This role involves leading projects, engaging through the entire lifecycle of services, and ensuring systems remain reliable and efficient. Candidates should have 8 years... 
    Suggested

    Google Inc.

    Sunnyvale, CA
    6 days ago
  • Google is seeking a Staff Technical Lead to oversee the reliability, availability, and serviceability of a next-generation AI accelerator system. This role involves defining the reliability strategy and collaborating with engineering teams through the full product lifecycle... 
    Suggested

    Google

    Sunnyvale, CA
    3 days ago
  • $228.1k - $393.8k

    Site Reliability Engineering Manager, Storage - Apple Services Engineering Cupertino, California, United States Software and...  ...budgets, fault analysis, and other common reliability engineering concepts....  ...storage solutions. Ability to lead teams spread across geographic regions... 
    Suggested
    Relocation

    Apple Inc.

    Cupertino, CA
    4 days ago
  •  ...PetSmart does Anything for Pets – JOIN OUR TEAM!   Retail Customer Service Lead (Key Holder) About Life at PetSmart   At PetSmart,...  ...noise at times. Exposure to live animals and their handling is common.  Do what you love   Join us for a chance to make a meaningful... 
    Hourly pay
    Weekly pay
    Minimum wage
    Full time
    Local area
    Immediate start
    Weekend work
    Afternoon shift

    PetSmart

    Mountain View, CA
    2 hours ago
  • Lead Systems Quality and Reliability Engineer We are seeking a Lead Systems Quality and Reliability Engineer to join our LPU team. Responsibilities Own, build, and manage RMA and FA debug and root‑cause analysis for existing and new NVIDIA AI/ML products. Conduct tests... 

    NVIDIA Corporation

    Santa Clara, CA
    6 days ago
  • $157.2k - $281.9k

     ...Commodity Manager based in Sunnyvale, California. In this role, you will manage continuity of supply and supplier quality for AppleCare service products. The ideal candidate will have a Bachelor's degree and over 7 years of experience in operations or supply chain, with... 

    Apple Inc.

    Sunnyvale, CA
    5 days ago
  • $126k - $204.5k

     ...Grafana. The ideal candidate should have over 5 years of experience, strong skills in cloud technologies, and a passion for high reliability. Compensation ranges from $126,000 to $204,500 annually, depending on experience and qualifications. #J-18808-Ljbffr Palo Alto Networks... 

    Palo Alto Networks, Inc.

    Santa Clara, CA
    5 days ago
  • $16.75 - $23 per hour

    Dormont Manufacturing Co in Mountain View is seeking passionate Shift Leads to manage daily operations at Peet's coffeebar and ensure excellent service. The ideal candidate will have supervisory experience, strong leadership skills, and a commitment to quality and customer... 
    Hourly pay
    Flexible hours
    Shift work

    Dormont Manufacturing Co

    Mountain View, CA
    2 days ago
  • Dormont Manufacturing Co is seeking a Supervisor to support store operations and enhance customer service delivery. Candidates should possess supervisory experience in retail, excel in customer service, and showcase interpersonal skills. The role involves training team... 

    Dormont Manufacturing Co

    Mountain View, CA
    2 days ago
  • $120k - $145k

     ...experience, focusing on design and implementation of multi-cloud systems. Responsibilities include leading initiatives across teams, optimizing performance, and improving reliability. The position offers a salary range of $120,000-$145,000, along with comprehensive benefits... 

    Fortinet, Inc.

    Sunnyvale, CA
    5 days ago
  • $208k - $280k

    Intuit Inc. is hiring a Principal Service Experience Strategy for the Mid-Market segment in Mountain View, California. This pivotal role involves establishing a comprehensive service experience strategy while collaborating with cross-functional teams to enhance customer... 

    Intuit Inc.

    Mountain View, CA
    6 days ago
  • $172.1k - $305.6k

    Senior Global Pricing Strategy Lead - Services Cupertino, California, United States Software and Services Apple's Services connects customers around the world to an extraordinary range of content spanning apps, music, games, fitness, movies, TV, news, books, cloud storage... 
    Relocation

    Apple Inc.

    Cupertino, CA
    5 days ago
  • $201.6k - $302k

    General Motors in Sunnyvale is looking for a Senior Engineering Manager for Hybrid Services & Reliability. This role involves leading a team responsible for ensuring the reliability of hybrid cloud systems crucial for autonomous vehicle development. The ideal candidate... 

    General Motors

    Sunnyvale, CA
    5 days ago
  • Apple Inc. is seeking a Senior Site Reliability Engineer based in Cupertino, California, to drive reliability standards across the Apple Data Platform. You will mentor engineers and ensure that large-scale infrastructures run reliably and efficiently. With a focus on technical... 

    Apple Inc.

    Cupertino, CA
    4 days ago
  • Apple is seeking a Senior Global Pricing Strategy Lead to join their Global Pricing Strategy team in Cupertino, California. In this role...  ...and develop global pricing recommendations for various Apple services. The ideal candidate will possess outstanding analytical skills,... 

    Apple

    Cupertino, CA
    2 days ago
  • $176k - $276k

     ...impact on the world. We are looking for a Senior Leader within the Services and Repair Transformation and Enablement Team! The team is...  ...predictive sparing, automated routing, or defect trend analysis. Lead development of business requirements and collaborate to drive technical... 
    Contract work
    Temporary work
    Worldwide

    Nvidia Corporation

    Santa Clara, CA
    2 days ago
  • $151.6k - $245.3k

    Palo Alto Networks, Inc. seeks a Principal Site Reliability Engineer in Santa Clara, CA. The role involves driving SRE and DevOps initiatives, architecting scalable solutions, and ensuring application reliability. Ideal candidates will have 7+ years of relevant experience... 

    Palo Alto Networks, Inc.

    Santa Clara, CA
    6 days ago
  • NVIDIA Gruppe is hiring a Lead Systems Quality and Reliability Engineer in Santa Clara, California. In this role, you will manage RMA and FA debug analysis for AI/ML products, conduct tests, and collaborate with various engineering teams. The ideal candidate has a BS/MS... 

    NVIDIA Gruppe

    Santa Clara, CA
    5 days ago
  • Compass Group, North America is looking for a Premium Club Lead at Levi's Stadium in Santa Clara, California. The primary role is to monitor food service operations, ensuring that top-quality products and services are provided. Responsibilities include inventory management... 

    Compass Group, North America

    Santa Clara, CA
    4 days ago
  • A leading retail company is seeking a Starbucks Supervisor to oversee daily operations within their café. This full-time role requires...  ...school diploma, food safety certification, and retail or food service experience. Join us and contribute to a dynamic work culture while... 
    Full time

    Bloomingdale's Inc.

    Santa Clara, CA
    3 days ago
  • A regional healthcare provider is seeking a dedicated Regional Customer Service/Sales Manager to lead sales and customer service efforts. This role involves training and supporting advocates to achieve sales goals and maintain customer satisfaction. Candidates should possess... 
    Flexible hours

    Alpaca Audiology

    Mountain View, CA
    4 days ago
  • DHL Express is looking for a qualified Field Service Supervisor in Sunnyvale, California. This role involves ensuring client satisfaction and adherence to compliance regulations in daily operations including supervising cargo handling and back-office functions. The ideal... 

    DHL Express

    Sunnyvale, CA
    5 days ago
  •  ...heart-of-house teams. Ideal candidates should have at least 1 year of experience in food and beverage operations and a passion for service. Treehouse Hotels is dedicated to creating a playful and nurturing environment, providing competitive health benefits and opportunities... 

    Treehouse Hotels

    Sunnyvale, CA
    6 days ago
  •  ...for critical infrastructure across cloud services. This role requires expertise in...  ...architecture, ensuring the integrity and reliability of financial data operations. The professional...  ...resilient systems, defining standards, and leading cross-functional collaboration for... 

    Apple

    Cupertino, CA
    3 days ago
  • $207k - $300k

    A leading technology company in Sunnyvale, CA is seeking a Software Engineering Manager II for Site Reliability Engineering. You'll lead a team to ensure uptime and optimize the availability...  ...scalability, and performance of key services. With a focus on automation and... 

    Google Inc.

    Sunnyvale, CA
    6 days ago
  • $198.3k - $342.8k

    Site Reliability Engineering Manager, eBusiness Services Sunnyvale, California, United States Software and Services Imagine what we could do together. At Apple...  ...we do, from amazing technology to industry‑leading environmental efforts. Apple's eBusiness Services team... 
    Relocation

    Apple Inc.

    Sunnyvale, CA
    4 days ago
  • $98.04k - $154.8k

    Position Overview The Service Desk Program Manager designs the T0/T1 service desk model for a strategic enterprise technology client’s...  ...AI OpsHub Configuration Blueprint: Agent‑Assist templates for common T1 response types, SmartQueue routing rules, and T1‑to‑T2 handoff... 
    Temporary work
    Immediate start
    Flexible hours

    Astreya Partners

    Santa Clara, CA
    4 days ago
  • $36 - $38 per hour

     ...Lead Apartment Maintenance Technician Woodmont is seeking a professional...  ...Friday; 8am - 5pm. The Service Technician may share on-call...  ...of the community by cleaning common areas, pressure-washing, and general...  .... Valid driver's license and reliable transportation needed.... 
    Hourly pay
    Full time
    Temporary work
    For contractors
    Work experience placement
    Work at office
    Monday to Friday
    Flexible hours
    Shift work
    Afternoon shift

    Woodmont Real Estate Services

    Sunnyvale, CA
    5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Reliability Lead, Common Services. Be the first to apply!