Lead Site Reliability Engineer Job Description

Lead Site Reliability Engineer Job Description Template

Our company is looking for a Lead Site Reliability Engineer to join our team.

Responsibilities:

  • Work with the production support team to adopt monitoring tools and processes;
  • Build deep, full-stack knowledge of our platforms and applications;
  • Serve as project manager or scrum master for major initiatives and train the team to be the first line of support;
  • Participate in design reviews and make recommendations to improve the reliability and maintainability of the system;
  • Mentor and manage 1-3 person teams providing technical guidance and expertise;
  • Ensure software has good logging and diagnostics;
  • Take ownership of many controls, processes, and risks required to maintain our compliance portfolio (SOC 2, PCI-DSS, GDPR, and HIPAA, among others);
  • Travel occasionally to meet with the rest of Lightbend’s technical team;
  • Participate in root cause analysis reviews to discuss the root cause of production issues, and identify improvements to avoid in the future;
  • Help triage and respond to incidents escalated to the Engineering team, including emergencies, escalating to the development team as needed;
  • Develop automation, processes and metrics to ensure maximum reliability and uptime for our customers;
  • Continuously improve observability to ensure the uptime and reliability of our applications and infrastructure;
  • Establish an on-call cadence with the team and ensure adequate coverage areas;
  • Proactively monitor and review application performance;
  • Help create and maintain an environment that provides security and privacy for our customers data.

Requirements:

  • Development / automation experience in Python, Ansible, Git preferred;
  • Experience with Office 365 automatio n ;
  • 5+ years of experience VMWare-based virtualization ( ESXi / vCenter);
  • Bachelor’s degree in Information Technology or related field with 10+ years’ IT experience, preferably in an IaaS, SaaS, or similar industry;
  • Highly motivated person with the ability to learn new technologies hands on an on-going basis;
  • Expert-level knowledge in a hybrid cloud environment consisting of Azure Cloud (IaaS/PaaS) and VMWare-based virtualization ( ESXi / vCenter);
  • Enjoys collaborati ng with a wide variety of teams within and outside domain;
  • Build a team culture to aim for high service availability, scalability and observability goals;
  • Experience with containers and Kubernetes;
  • Software development experience using Go, Python and Java;
  • At least 5 years of work experience in Site Reliability/Infrastructure Engineering for a team operating in public cloud;
  • Proven track record of designing, building, optimizing, and maintaining infrastructure on a large scale;
  • Experience with Kafka, MySQL, Influxdb, Elasticsearch, Redis, and/or Memcached;
  • 5+ years of professional experience starting from a developer role and transitioning to a SRE role;
  • A passion for SRE/DevOps and running highly resilient/automated systems.