Site Reliability Engineer
SRM Digital LLC
Site Reliability Engineer
Required Technical / Functional Skills
- 7+ years of experience in Site Reliability Engineering (SRE), Platform Engineering, Cloud Infrastructure Engineering, or related roles within large-scale enterprise environments.
- Minimum 4+ years of hands-on experience working primarily within Microsoft Azure cloud environments.
- Strong expertise in Azure Kubernetes Service (AKS), including cluster lifecycle management, RBAC, network security policies, pod security standards, autoscaling, workload identity, and platform governance.
- Proven experience building and supporting microservices-based applications using Java and implementing CI/CD pipelines using Azure DevOps (ADO).
- Hands-on experience designing, implementing, and operating enterprise-scale observability solutions using Dynatrace.
- Strong understanding and practical experience establishing Service Level Objectives (SLOs), Service Level Indicators (SLIs), Error Budgets, and reliability-focused operational practices.
- Strong scripting and automation experience using Python, PowerShell, Azure Automation, and cloud-native tooling.
Roles & Responsibilities
Reliability Engineering & Platform Ownership
- Define, establish, and continuously improve enterprise-wide reliability standards, including SLOs, SLIs, and Error Budgets across business-critical Azure-hosted services.
- Own service reliability metrics and regularly communicate SLA compliance, operational health, and reliability improvements to business and executive stakeholders.
- Partner with architecture, development, and platform teams to ensure reliability, scalability, and resiliency requirements are embedded throughout the service lifecycle.
- Conduct architecture and design reviews to ensure availability targets, resilience requirements, and recovery objectives are incorporated from initial design through production deployment.
- Drive adoption of reliability engineering best practices and champion proactive resilience initiatives including chaos engineering methodologies.
Incident Management & Operational Excellence
- Lead major incident management activities by serving as Incident Commander for high-priority production incidents (P1/P2) and driving resolution efforts across cross-functional teams.
- Own the end-to-end incident lifecycle including detection, escalation, communication, resolution management, and post-incident reviews.
- Participate in structured global on-call rotations and maintain operational response objectives for mission-critical services.
- Foster a blameless postmortem culture focused on continuous improvement and ensure corrective actions are tracked through completion.
Disaster Recovery & Resiliency
- Design, implement, and maintain Disaster Recovery (DR) strategies across Azure environments to ensure business continuity and operational resilience.
- Lead regular disaster recovery exercises, validate recovery processes, and continuously improve recovery readiness across critical workloads.
- Establish and maintain Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) aligned with business requirements.
Observability & Monitoring
- Design, build, and operate enterprise observability capabilities using Dynatrace to provide comprehensive visibility across Metrics, Events, Logs, and Traces (MELT).
- Develop monitoring standards, dashboards, alerting frameworks, and operational reporting to improve service visibility and reduce incident response times.
- Integrate monitoring and alerting platforms with enterprise tools including PagerDuty and ServiceNow to enable proactive operations.
Automation & Platform Engineering
- Build automation frameworks, operational tooling, self-healing capabilities, and reusable platform services to improve operational efficiency and reduce manual effort.
- Develop and maintain infrastructure automation, operational runbooks, and platform engineering capabilities using Azure-native services and scripting technologies.
- Continuously identify opportunities to improve reliability, scalability, security, and operational efficiency through automation and platform enhancements.
Vacancy posted more than 2 months ago
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Site Reliability Engineer. Be the first to apply!
Related searches
- site safety Deerfield, IL
- on-site clinical research associate (traveling/remote) Deerfield, IL
- junior website developer Deerfield, IL
- site reliability engineer
- site reliability engineer sre
- junior site reliability engineer
- lead site reliability engineer
- site reliability engineer remote
- site reliability engineering manager
- on site coordinator
