Site Reliability Engineering Manager Job Description

Site Reliability Engineering Manager Job Description Template

Our company is looking for a Site Reliability Engineering Manager to join our team.

Responsibilities:

  • Compile KPIs and evangelize the adoption of best practices in relation to performance and reliability across the organization;
  • Embed into SRE projects and on-call rotations to keep your skills sharp and stay close to the operational workflows and issues;
  • Promote a healthy and functional work environment;
  • Maintain project and operational workload statistics;
  • Provide a solid foundation for building and maintaining successful SRE teams.

Requirements:

  • Exposure to Cloud, SaaS, and virtualization concepts and performance concerns;
  • Experience with stream-processing open source frameworks/systems, i.e. Kafka, Spark, etc;
  • Exposure to application threading and concurrency concerns;
  • Systems often need to be reconfigured, so you should have experience with a configuration management system like Puppet, Chef or Salt;
  • Working knowledge of operating system design , processes, and threading model;
  • Knowledge of defining and monitoring system quality measures, including SLO and SLA;
  • Built tooling to improve reliability of systems, automated remediation of issues, or improve scalability;
  • Experience with different flavors of Linux, i.e. RedHat, Ubuntu, CentOS, etc;
  • Hands-on experience collecting performance data, analyzing, troubleshooting, and tuning;
  • Experience delivering software designed for high concurrency, scalability, or availability;
  • Experience leading high performing engineering teams;
  • Software development experience using Go;
  • Experience with containers and container orchestration tools (Docker, Kubernetes and Spinnaker experience preferred);
  • Proven track record of designing, building, optimizing, and maintaining infrastructure on a large scale;
  • Experience with Kafka, MySQL, Influxdb, Elasticsearch, Redis, and/or Memcached.