Senior Site Reliability Engineer Job Description

Senior Site Reliability Engineer Job Description Template

Our company is looking for a Senior Site Reliability Engineer to join our team.

Responsibilities:

  • Implement automation tools and frameworks;
  • Continuously refine monitoring processes, configurations, and thresholds;
  • Practice sustainable incident response and blameless postmortems;
  • You’ll be based out of our SF office or work remotely based in the United States;
  • Participate in a rotating on-call schedule to troubleshoot and resolve production escalations from our 24x7x365 NOC;
  • Build tools to help Operations teams to quickly pinpoint, isolate and resolve issues related to infrastructure, plaform services and applications;
  • You will monitor, maintain and help scale services that are integrated into S&P’s platform;
  • Develop playbooks and tools to streamline processes and shorten problem resolution time;
  • Automate all the things;
  • Monitor and optimize application performance within the deployment architecture;
  • Write code that improves scalability, performance, maintainability and security;
  • You will add, tune and maintain alert configurations and documentation as needed;
  • Ability to operate in the high-pressure environment and troubleshoot complex issues quickly, while successfully handling multiple priorities;
  • You will cultivate full-team participation in high quality, thoughtful software;
  • Learn or increase your expertise in coding – we use Python.

Requirements:

  • Scripting languages like Ruby, Groovy, Bash, PowerShell or Python;
  • Object-Oriented Software development in Java, Scala, etc;
  • NoSQL (etc., Couchbase, Cassandra);
  • Programming expertise in either Python or Ruby, with demonstrated knowledge of software engineering best-practice development (e.g., linting, testing);
  • Experience programming with Python/Java, and/or the ability and interest to learn, is required;
  • Experience in infrastructure like GCP, AWS, mysql;
  • Knowledge of best practices and IT operations in an always-up, always-available service;
  • Good experience with SQL and with Postgres or similar RDBMS;
  • 5+ years of experience working in operations;
  • You possess expertise in scalable testing, automation, continuous integration frameworks and best practices;
  • BS Degree in Computer Science, Electrical & Computer Engineering or Mathematics or equivalent experience;
  • Experience in SDLC, distributed systems, networking, hardware, logistics and operations or capacity planning;
  • Strong background in Linux/Unix Administration;
  • 5+ years of experience with Windows and/or Linux operating systems internals and administration (e.g., filesystems, inodes, system calls);
  • Experience with algorithms, data structures, complexity analysis and software design.