Site Reliability Engineer Job Description

Site Reliability Engineer Job Description Template

Our company is looking for a Site Reliability Engineer to join our team.

Responsibilities:

  • Work within a highly skilled team of engineers to deliver revolutionary improvements to the cloud and scale them;
  • Drive efficiencies through software improvement and root cause analysis resulting in service delivery, maturity, and scalability;
  • Knowledge of cloud platforms (AWS/Azure);
  • Practice sustainable incident response and blameless postmortems;
  • Taking action to get our HA production environments to “just work” without manual intervention or midnight alerts;
  • Create and maintain operational documentation and runbooks;
  • Understand complex technical details and build test methodologies for them;
  • Provide defect tracking and broken link tracking;
  • You will cultivate full-team participation in high quality, thoughtful software;
  • You will add, tune and maintain alert configurations and documentation as needed;
  • Help define Dave’s best practices and teach your teammates how to use them moving forward;
  • You will monitor, maintain and help scale services that are integrated into S&P’s platform;
  • Assist with the implementation and integration of AWS services into the CMT cloud service infrastructure to enhance scalability and robustness;
  • Engage with other Engineering and Product teams to improve reliability, performance, availability and security of the Coupa Cloud;
  • Evangelize the adoption of best practices in relation to performance and reliability.

Requirements:

  • CI/CD Pipeline Fundamentals;
  • Bachelor’s Degree in Computer Science, Computer Engineering or a closely related field;
  • An understanding and passion for testing, architecture, and observability;
  • 2+ years of experience in UNIX/Linux operating system;
  • Experience with source control tooling, such as TFS or GIT, in a team environment;
  • Demonstrated proficiency in at least one of programming the following languages; Python, Java, Golang;
  • Experience working with large scale production deployments of thousands of servers;
  • Linux, no matter the flavor;
  • Experience designing, debugging and running fault tolerant large-scale distributed systems;
  • Cloud Formations;
  • Master’s degree in computer science or related degree;
  • Assist in the development and management of the Infrastructure as Code (IaC) processes;
  • Bachelor’s degree in Computer Science or related discipline;
  • Knowledge of cloud and virtualization technology;
  • Knowledge of web technologies IE: JBoss, Tomcat, Apache Server, WebSphere, etc.).