Site Reliability Engineering Manager Job Description Template
Our company is looking for a Site Reliability Engineering Manager to join our team.
Responsibilities:
- Compile KPIs and evangelize the adoption of best practices in relation to performance and reliability across the organization;
- Embed into SRE projects and on-call rotations to keep your skills sharp and stay close to the operational workflows and issues;
- Promote a healthy and functional work environment;
- Maintain project and operational workload statistics;
- Provide a solid foundation for building and maintaining successful SRE teams.
Requirements:
- Exposure to Cloud, SaaS, and virtualization concepts and performance concerns;
- Experience with stream-processing open source frameworks/systems, i.e. Kafka, Spark, etc;
- Exposure to application threading and concurrency concerns;
- Systems often need to be reconfigured, so you should have experience with a configuration management system like Puppet, Chef or Salt;
- Working knowledge of operating system design , processes, and threading model;
- Knowledge of defining and monitoring system quality measures, including SLO and SLA;
- Built tooling to improve reliability of systems, automated remediation of issues, or improve scalability;
- Experience with different flavors of Linux, i.e. RedHat, Ubuntu, CentOS, etc;
- Hands-on experience collecting performance data, analyzing, troubleshooting, and tuning;
- Experience delivering software designed for high concurrency, scalability, or availability;
- Experience leading high performing engineering teams;
- Software development experience using Go;
- Experience with containers and container orchestration tools (Docker, Kubernetes and Spinnaker experience preferred);
- Proven track record of designing, building, optimizing, and maintaining infrastructure on a large scale;
- Experience with Kafka, MySQL, Influxdb, Elasticsearch, Redis, and/or Memcached.