Site Reliability Engineering Manager Job Description

Site Reliability Engineering Manager Job Description Template

Our company is looking for a Site Reliability Engineering Manager to join our team.

Compile KPIs and evangelize the adoption of best practices in relation to performance and reliability across the organization;
Embed into SRE projects and on-call rotations to keep your skills sharp and stay close to the operational workflows and issues;
Promote a healthy and functional work environment;
Maintain project and operational workload statistics;
Provide a solid foundation for building and maintaining successful SRE teams.

Exposure to Cloud, SaaS, and virtualization concepts and performance concerns;
Experience with stream-processing open source frameworks/systems, i.e. Kafka, Spark, etc;
Exposure to application threading and concurrency concerns;
Systems often need to be reconfigured, so you should have experience with a configuration management system like Puppet, Chef or Salt;
Working knowledge of operating system design , processes, and threading model;
Knowledge of defining and monitoring system quality measures, including SLO and SLA;
Built tooling to improve reliability of systems, automated remediation of issues, or improve scalability;
Experience with different flavors of Linux, i.e. RedHat, Ubuntu, CentOS, etc;
Hands-on experience collecting performance data, analyzing, troubleshooting, and tuning;
Experience delivering software designed for high concurrency, scalability, or availability;
Experience leading high performing engineering teams;
Software development experience using Go;
Experience with containers and container orchestration tools (Docker, Kubernetes and Spinnaker experience preferred);
Proven track record of designing, building, optimizing, and maintaining infrastructure on a large scale;
Experience with Kafka, MySQL, Influxdb, Elasticsearch, Redis, and/or Memcached.