Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Senior Production Engineer, Operational Excellence

$300 per month

Crusoe

Job Description Job Description Crusoe is on a mission to accelerate the abundance of energy and intelligence . As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster. We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI. We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services. If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe. About This Role: Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platform — and Production Engineering sits at the heart of that mission. As a Production Engineer focused on Operational Excellence, you will help ensure the reliability, scalability, and performance of Crusoe’s GPU cloud that powers next-generation AI workloads. This role is ideal for engineers who enjoy solving complex production problems, improving large-scale distributed systems, and building automation that keeps infrastructure running smoothly. You’ll play a key role in strengthening the operational foundation of Crusoe’s cloud while helping scale infrastructure that supports demanding AI and HPC workloads. You’ll partner closely with Production Engineers, infrastructure teams, and platform engineers to improve system reliability, reduce operational toil, and drive continuous improvements across Crusoe’s rapidly growing GPU cloud. What You’ll Be Working On: Collaborate with cross-functional teams to define and evolve availability metrics for Crusoe’s cloud platform, including establishing, measuring, and improving SLIs and SLOs Participate in production incident response, diagnosing and resolving service disruptions while contributing to post-incident reviews and root cause analysis Build, operate, and improve observability across Crusoe’s infrastructure using tools such as Prometheus, Grafana, Alertmanager, and OpenTelemetry Identify reliability risks, performance bottlenecks, and early indicators of potential production issues across distributed systems Develop automation and tooling that reduces operational toil, improves recovery times, and enables self-healing infrastructure Partner with compute, networking, storage, and platform teams to strengthen service resilience and disaster recovery capabilities Contribute to improving operational processes, knowledge sharing, and reliability best practices across the engineering organization Continue growing technical depth through mentorship, training, and hands-on work operating large-scale AI infrastructure What You’ll Bring to the Team: 5+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems Strong knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space Previous experience in Infrastructure roles building or managing compute, storage or networking platforms Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP) Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar) Experience with monitoring and observability tools such as Prometheus and Grafana, or a strong desire to deepen expertise in this area Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible Scripting or programming experience with languages such as Go, Python, C, or C++ Strong communication skills and the ability to collaborate across engineering teams Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments A growth mindset and strong interest in reliability engineering, automation, and operational excellence Bonus Points: Experience working with Kubernetes or container orchestration platforms at scale Exposure to change management processes, operational readiness reviews, or structured root cause analysis Experience designing self-healing systems, automated remediation, or event-driven operational tooling Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments Passion for mentorship, learning, and developing deeper expertise in Production Engineering Benefits: Industry competitive pay Restricted Stock Units in a fast growing, well-funded technology company Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents Employer contributions to HSA accounts Paid Parental Leave Paid life insurance, short-term and long-term disability Teladoc 401(k) with a 100% match up to 4% of salary Generous paid time off and holiday schedule Cell phone reimbursement Tuition reimbursement Subscription to the Calm app MetLife Legal Company paid commuter benefit; $300 per month Compensation: Compensation will be paid in the range of $172,000 – $209,000 + Bonus. Restricted Stock Units are included in all offers. Compensation will be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data. Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex/gender, sexual preference/ orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

Vacancy posted 16 hours ago
Similar jobs that could be interesting for youBased on the Senior Production Engineer, Operational Excellence in Sunnyvale, CA vacancy
  •  ...leader in Consumer Electronics products & services, is looking for Senior System Test Automation Engineer . Kindly see the details...  ...role would be a part of the Operations Engineering team and is...  ...or equivalent experience • Excellent knowledge of software testing... 
    Senior
    Long term contract
    Full time

    Dawar Consulting

    Sunnyvale, CA
    2 days ago
  • $152k - $241.5k

     ...infrastructure efficiency. Success in this role requires both operational precision along with developing and supporting forward-...  ...environment remains resilient, measurable, and aligned with long-term engineering demands. What you'll be doing: Manage, scale, and... 
    Senior

    NVIDIA

    Santa Clara, CA
    20 hours ago
  • $165k - $242k

     ...Companies of 2024, CoreWeave operates a rapidly expanding...  ...distributed systems, solving hard production problems, and operating...  ...the Role Production Engineering ensures CoreWeave's cloud...  ..., and operational excellence. We are hiring a Senior Production Engineer to take... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    16 hours ago
  • $106.8k - $194.8k

     ...Join EY and help to build a better working world. WAF Operations Solution Engineer PRACTICE DESCRIPTION: As a WAF Operations Solution...  ...certifications (e.g., CISSP, Security+) are a plus. ~ Excellent analytical, problem-solving, and communication skills.... 
    Senior
    Summer holiday
    Flexible hours

    EY

    San Jose, CA
    3 days ago
  • $184k - $287.5k

     ...NVIDIA, our Financial Services Engineering (FSE) group's mission is to...  .... We are looking for a Senior Data Engineer to join our Financial...  ...financial systems, enabling operational automation across global...  ...demands. Operational Excellence: Collaborate across teams to... 
    Senior
    Remote work

    NVIDIA

    Santa Clara, CA
    3 days ago
  • $179k - $218k

     ...Senior Staff Data Center Operations Engineer, GPU Hardware Architecture Crusoe is on a mission to accelerate the abundance of energy and intelligence...  ...point for the most complex hardware failures in the production environment. Lead Root Cause Analysis (RCA) on systemic... 
    Senior
    Temporary work

    Crusoe

    Sunnyvale, CA
    1 day ago
  • $200k - $322k

     ...NVIDIANs are inspired to excel and make a profound global...  ...impact. We're hiring a Senior Staff Software Engineer to own the engineering efforts...  ...of what AI-assisted operations can achieve. Build robust...  ...Enterprise scale collaboration, productivity, AI and Infrastructure... 
    Senior

    NVIDIA

    Santa Clara, CA
    4 days ago
  •  ...Remote T he Site Reliability Engineer (SRE) will be a lead on the DevOps team...  ...installation, configuration, maintenance, operations, and architecture of AWS cloud...  ...team in implementing and maintaining all production and pre-production environments by implementing... 
    Senior
    Remote work

    3B Staffing LLC

    San Jose, CA
    20 hours ago
  • $190.9k - $334.1k

     ...and AI agents. ( For engineers joining Veza today, this...  ...company, with the product velocity and mission-driven...  ...of engineering excellence and product quality, you...  ...a QA role. This is a senior engineering leadership...  ...software. You will operate with startup-level ownership... 
    Senior
    Work at office
    Remote work
    Flexible hours
    Shift work

    ServiceNow

    Santa Clara, CA
    10 days ago
  •  ...Senior Engineer, Photonics IC Automation And Test Design Coherent Corp is the global leader...  ...next-generation optical communication products. This role focuses on developing and optimising...  ...perform at their best, while rewarding excellence and hard-work through a competitive... 
    Senior
    Full time
    Worldwide

    Coherent

    Santa Clara, CA
    1 hour ago
  • $140k - $190k

     ...are a team of mission-driven engineers with experience across...  ...future a reality. As a Sr. Production Systems Engineer, you'll be...  ...foundation for how the business operates for years to come. This is a...  ...system architectures ~ Excellent written and verbal communication... 
    Senior
    Permanent employment

    Reliable Robotics Corporation

    Mountain View, CA
    4 days ago
  • $170k - $255k

     ...Security Operations Engineer Santa Clara, California We're in an unbelievably exciting area...  ...environments. You will report to the Senior Security Operations Manager and work closely...  ...& Connectivity: Lead the operational excellence of Zscaler (ZIA/ZPA). You will manage... 
    Work at office
    Flexible hours
    Shift work

    Pure Storage

    Santa Clara, CA
    4 days ago
  • $140k - $185k

     ...Principal Cloud Engineering and Production Operations Engineer The Principal Cloud and Production Operations Engineer serves as the senior technical authority responsible for architecting,...  ...knowledge sharing and engineering excellence Lead architectural reviews, design... 
    For subcontractor
    Local area

    A10 Networks

    San Jose, CA
    1 day ago
  • $190k - $282k

     ...Senior Security Production Engineer Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA / San Francisco, CA CoreWeave is The Essential...  ...growing global footprint, enabling safe and efficient operations for enterprise and AI workloads at scale. About the role... 
    Senior
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    Sunnyvale, CA
    20 hours ago
  • $82.97k - $110.63k

     ...in building the future. The Role Senior Engineer position requires a high degree of technical...  ..., standardize, and automate network operations by leveraging AI-driven insights and...  ...deployed system design as well as new products for compatibility and applicability to... 
    Senior
    Full time
    Temporary work
    Work at office
    Remote work
    Night shift

    Lumen

    San Jose, CA
    20 hours ago
  • $110k - $116k

     ...s leading AI-powered Quality Engineering Company? Ready to advance your...  ...! We are looking for a Senior QE Automation Engineer to join...  ...closely with engineering and product teams Our digital team is looking...  ...across teams and roles. * Excellent presentation skills to present... 
    Senior
    Casual work
    Local area
    Flexible hours

    QualiTest Group

    Santa Clara, CA
    20 hours ago
  • $298k - $368k

     ...service and can also be applied to a range of vehicle platforms and product use cases. The Waymo Driver has provided over ten million...  ...robotics and machine learning, driving the next generation of operational efficiency for Waymo's rapidly expanding autonomous fleet. You... 
    Senior
    Full time
    Remote work

    Waymo

    Mountain View, CA
    3 hours ago
  •  ...regression tests to prevent re-occurrence of issues. Provide technical guidance for team members and coworkers on development and operations. Communicate and highlight any potential risks Skills: Experience with SDLC including unit tests, code management and build... 
    Senior

    Rootshell Inc

    Santa Clara, CA
    20 hours ago
  •  ...As data operations engineer, you will collaborate with various infrastructure teams, platforms, product, engineering and scientists, to identify requirements that will derive the...  ...efficiency and the quality of services. • Excellent verbal and written communication skills.... 

    Insight Global

    Cupertino, CA
    4 days ago
  • $110.5k - $166.1k

     ...Financial Analyst, Operations & Engineering Imagine what you could do here....  ...of becoming extraordinary products, services, and customer experiences...  ...our global commitment to excellence, and it is directly...  ...and service operations for senior leadership. Analyze new... 
    Relocation

    Apple

    Cupertino, CA
    2 days ago
  • $184k - $287.5k

     ...Become a Senior System Software Engineer on NVIDIA's AI Inference Operations Team, focusing on DevOps and Infrastructure Automation. Join a company revolutionizing...  ..., cloud-native infrastructure, and developer productivity, this is your opportunity to make a lasting impact... 
    Senior

    NVIDIA

    Santa Clara, CA
    4 days ago
  •  ...adversarial behavior, and ensures safe operations for commercial and government...  ...The Opportunity This is a senior, project-oriented IT systems engineering role focused on building scalable...  ...maintaining ITSM/ESM systems ~ Excellent customer service, communication,... 
    Senior
    Work at office
    Remote work
    Flexible hours
    2 days per week

    LeoLabs, Inc.

    Menlo Park, CA
    2 days ago
  •  ...Senior Account Operations Manager At Commure, our mission is to simplify healthcare...  ...interactions. With new product launches on the horizon,...  ...stakeholders, including our Product, Engineering, Billing, Launcher, Account...  ...and proficiency in SQL + Excel ~ Expertise in project... 
    Senior

    Commure

    Mountain View, CA
    50 minutes ago
  • $177k - $225k

     ...range of vehicle platforms and product use cases. The Waymo Driver...  ...to the Director of Business Operations. You will: Drive...  ...results and related findings to senior leadership and to the broader...  ...skills (e.g., advanced Excel / Google Sheets, SQL required... 
    Senior
    Full time
    Remote work

    Waymo

    Mountain View, CA
    4 hours ago
  • $184k - $287.5k

     ...NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. We are looking for Senior Software Engineers to help build the automation, tooling, and operational systems that make GPU clusters reliable, scalable, and... 
    Senior
    Remote work

    NVIDIA

    Santa Clara, CA
    5 days ago
  • $150k - $220k

     ...cybersecurity capabilities, maintaining a resilient security operations foundation is essential to protecting the company’s...  ...the environment in which the Security Operations Engineer plays a critical role. The  Senior Security Operations Engineer  leads project‑focused... 
    Senior
    Temporary work
    Flexible hours

    Samsung SDS

    San Jose, CA
    1 day ago
  • $148k - $296.4k

     ...improve the overall quality of our products. The role encompasses...  ...environment. You will report to the Senior Manager of Engineering, who will help you excel in your role. The work setup at...  ...offer of employment. This role operates in a hybrid capacity, blending the... 
    Senior
    Work at office
    Remote work
    Relocation package
    3 days per week

    Nutanix

    San Jose, CA
    4 days ago
  • $168k - $258.75k

     ...people. We are seeking a Senior NPI Program Manager (Operations). NVIDIA is growing in several...  ...about working on products that enable deep learning...  ...someone who has a strong engineering background with a keen interest...  ...is independent and can excel in their role as the... 
    Senior

    NVIDIA

    Santa Clara, CA
    3 days ago
  •  ...tasks and streamline business operations. Recognized on the Forbes...  ...automation with Moveworks' Reasoning Engine and natural language...  ...the Role We're hiring a Senior AI Automation Engineer for Customer...  ...: 75% building and deploying production systems, 25% architecting... 
    Senior
    Work at office
    Remote work
    Flexible hours

    ServiceNow

    Mountain View, CA
    3 days ago
  • $167k - $230k

     ...are a team of mission-driven engineers with experience across...  ...passionate about delivering quality products and services on time, with...  ..., conducting design for excellence (DFx) reviews, and developing...  ...foundation for how the business operates for years to come. This is a... 
    Senior
    Permanent employment

    Reliable Robotics Corporation

    Mountain View, CA
    2 hours ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Senior Production Engineer, Operational Excellence. Be the first to apply!