Lead Site Reliability Engineer Job Description

Lead Site Reliability Engineer Job Description Template

Our company is looking for a Lead Site Reliability Engineer to join our team.

Work with the production support team to adopt monitoring tools and processes;
Build deep, full-stack knowledge of our platforms and applications;
Serve as project manager or scrum master for major initiatives and train the team to be the first line of support;
Participate in design reviews and make recommendations to improve the reliability and maintainability of the system;
Mentor and manage 1-3 person teams providing technical guidance and expertise;
Ensure software has good logging and diagnostics;
Take ownership of many controls, processes, and risks required to maintain our compliance portfolio (SOC 2, PCI-DSS, GDPR, and HIPAA, among others);
Travel occasionally to meet with the rest of Lightbend’s technical team;
Participate in root cause analysis reviews to discuss the root cause of production issues, and identify improvements to avoid in the future;
Help triage and respond to incidents escalated to the Engineering team, including emergencies, escalating to the development team as needed;
Develop automation, processes and metrics to ensure maximum reliability and uptime for our customers;
Continuously improve observability to ensure the uptime and reliability of our applications and infrastructure;
Establish an on-call cadence with the team and ensure adequate coverage areas;
Proactively monitor and review application performance;
Help create and maintain an environment that provides security and privacy for our customers data.

Development / automation experience in Python, Ansible, Git preferred;
Experience with Office 365 automatio n ;
5+ years of experience VMWare-based virtualization ( ESXi / vCenter);
Bachelor’s degree in Information Technology or related field with 10+ years’ IT experience, preferably in an IaaS, SaaS, or similar industry;
Highly motivated person with the ability to learn new technologies hands on an on-going basis;
Expert-level knowledge in a hybrid cloud environment consisting of Azure Cloud (IaaS/PaaS) and VMWare-based virtualization ( ESXi / vCenter);
Enjoys collaborati ng with a wide variety of teams within and outside domain;
Build a team culture to aim for high service availability, scalability and observability goals;
Experience with containers and Kubernetes;
Software development experience using Go, Python and Java;
At least 5 years of work experience in Site Reliability/Infrastructure Engineering for a team operating in public cloud;
Proven track record of designing, building, optimizing, and maintaining infrastructure on a large scale;
Experience with Kafka, MySQL, Influxdb, Elasticsearch, Redis, and/or Memcached;
5+ years of professional experience starting from a developer role and transitioning to a SRE role;
A passion for SRE/DevOps and running highly resilient/automated systems.