Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Site Reliability Engineer

SRM Digital LLC

Site Reliability Engineer

Required Technical / Functional Skills

  • 7+ years of experience in Site Reliability Engineering (SRE), Platform Engineering, Cloud Infrastructure Engineering, or related roles within large-scale enterprise environments.
  • Minimum 4+ years of hands-on experience working primarily within Microsoft Azure cloud environments.
  • Strong expertise in Azure Kubernetes Service (AKS), including cluster lifecycle management, RBAC, network security policies, pod security standards, autoscaling, workload identity, and platform governance.
  • Proven experience building and supporting microservices-based applications using Java and implementing CI/CD pipelines using Azure DevOps (ADO).
  • Hands-on experience designing, implementing, and operating enterprise-scale observability solutions using Dynatrace.
  • Strong understanding and practical experience establishing Service Level Objectives (SLOs), Service Level Indicators (SLIs), Error Budgets, and reliability-focused operational practices.
  • Strong scripting and automation experience using Python, PowerShell, Azure Automation, and cloud-native tooling.

Roles & Responsibilities

Reliability Engineering & Platform Ownership

  • Define, establish, and continuously improve enterprise-wide reliability standards, including SLOs, SLIs, and Error Budgets across business-critical Azure-hosted services.
  • Own service reliability metrics and regularly communicate SLA compliance, operational health, and reliability improvements to business and executive stakeholders.
  • Partner with architecture, development, and platform teams to ensure reliability, scalability, and resiliency requirements are embedded throughout the service lifecycle.
  • Conduct architecture and design reviews to ensure availability targets, resilience requirements, and recovery objectives are incorporated from initial design through production deployment.
  • Drive adoption of reliability engineering best practices and champion proactive resilience initiatives including chaos engineering methodologies.

Incident Management & Operational Excellence

  • Lead major incident management activities by serving as Incident Commander for high-priority production incidents (P1/P2) and driving resolution efforts across cross-functional teams.
  • Own the end-to-end incident lifecycle including detection, escalation, communication, resolution management, and post-incident reviews.
  • Participate in structured global on-call rotations and maintain operational response objectives for mission-critical services.
  • Foster a blameless postmortem culture focused on continuous improvement and ensure corrective actions are tracked through completion.

Disaster Recovery & Resiliency

  • Design, implement, and maintain Disaster Recovery (DR) strategies across Azure environments to ensure business continuity and operational resilience.
  • Lead regular disaster recovery exercises, validate recovery processes, and continuously improve recovery readiness across critical workloads.
  • Establish and maintain Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) aligned with business requirements.

Observability & Monitoring

  • Design, build, and operate enterprise observability capabilities using Dynatrace to provide comprehensive visibility across Metrics, Events, Logs, and Traces (MELT).
  • Develop monitoring standards, dashboards, alerting frameworks, and operational reporting to improve service visibility and reduce incident response times.
  • Integrate monitoring and alerting platforms with enterprise tools including PagerDuty and ServiceNow to enable proactive operations.

Automation & Platform Engineering

  • Build automation frameworks, operational tooling, self-healing capabilities, and reusable platform services to improve operational efficiency and reduce manual effort.
  • Develop and maintain infrastructure automation, operational runbooks, and platform engineering capabilities using Azure-native services and scripting technologies.
  • Continuously identify opportunities to improve reliability, scalability, security, and operational efficiency through automation and platform enhancements.

Vacancy posted more than 2 months ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Site Reliability Engineer. Be the first to apply!