Lead Site Reliability Engineer Job Description Template
Our company is looking for a Lead Site Reliability Engineer to join our team.
Responsibilities:
- Work with the production support team to adopt monitoring tools and processes;
- Build deep, full-stack knowledge of our platforms and applications;
- Serve as project manager or scrum master for major initiatives and train the team to be the first line of support;
- Participate in design reviews and make recommendations to improve the reliability and maintainability of the system;
- Mentor and manage 1-3 person teams providing technical guidance and expertise;
- Ensure software has good logging and diagnostics;
- Take ownership of many controls, processes, and risks required to maintain our compliance portfolio (SOC 2, PCI-DSS, GDPR, and HIPAA, among others);
- Travel occasionally to meet with the rest of Lightbend’s technical team;
- Participate in root cause analysis reviews to discuss the root cause of production issues, and identify improvements to avoid in the future;
- Help triage and respond to incidents escalated to the Engineering team, including emergencies, escalating to the development team as needed;
- Develop automation, processes and metrics to ensure maximum reliability and uptime for our customers;
- Continuously improve observability to ensure the uptime and reliability of our applications and infrastructure;
- Establish an on-call cadence with the team and ensure adequate coverage areas;
- Proactively monitor and review application performance;
- Help create and maintain an environment that provides security and privacy for our customers data.
Requirements:
- Development / automation experience in Python, Ansible, Git preferred;
- Experience with Office 365 automatio n ;
- 5+ years of experience VMWare-based virtualization ( ESXi / vCenter);
- Bachelor’s degree in Information Technology or related field with 10+ years’ IT experience, preferably in an IaaS, SaaS, or similar industry;
- Highly motivated person with the ability to learn new technologies hands on an on-going basis;
- Expert-level knowledge in a hybrid cloud environment consisting of Azure Cloud (IaaS/PaaS) and VMWare-based virtualization ( ESXi / vCenter);
- Enjoys collaborati ng with a wide variety of teams within and outside domain;
- Build a team culture to aim for high service availability, scalability and observability goals;
- Experience with containers and Kubernetes;
- Software development experience using Go, Python and Java;
- At least 5 years of work experience in Site Reliability/Infrastructure Engineering for a team operating in public cloud;
- Proven track record of designing, building, optimizing, and maintaining infrastructure on a large scale;
- Experience with Kafka, MySQL, Influxdb, Elasticsearch, Redis, and/or Memcached;
- 5+ years of professional experience starting from a developer role and transitioning to a SRE role;
- A passion for SRE/DevOps and running highly resilient/automated systems.