Lead Site Reliability Engineer

Intellum

Job Description Job Description About us Intellum is the leader in corporate education technology and powers the largest, most successful customer, partner, and employee learning programs in the world. Large brands and fast-moving companies like Google, Meta, Amazon, Walmart, Xero, Atlassian, Mailchimp, Airbnb, Stripe, and TikTok rely on Intellum to engage and educate the audiences they touch. We have always been a "remote first" company and are proud to have team members located all over the world. We value Curiosity, Creativity, Perseverance, and Kindness and strive to demonstrate these core values every day. Our culture is very important to us. We invest in our people in fun and exciting ways, including personal development budgets and an annual all-company retreat that is focused less on work and more on human connections. We are in growth mode, and our "smart growth" approach ensures that we will continue to scale our company effectively. Summary We are seeking a Lead Site Reliability Engineer to spearhead our SRE team. You are not just an operator; you are an experienced software engineer who excels at architecture, code optimization, and deep troubleshooting. In this role, you will drive operational maturity by defining our reliability standards (SLOs), hardening our security posture (WAF/InfraSec), and scaling the Intellum platform. Our stack Core : Applications written in Ruby on Rails and Node.js, PostgreSql, MongoDB,, Redis, Memcached, Sidekiq, ActiveJob, Elasticsearch, Websockets Infrastructure : 100% Linux-based cloud infrastructure (AWS, Google Cloud, MongoDB Atlas) and services (ECS/EC2/Kubernetes, Elasticache, MemoryStore, RDS, CloudSQL, BigQuery etc.) Infrastructure as Code (IaC) : GitHub, Terragrunt, Terraform, Ansible CI/CD: Spinnaker, Jenkins Observability & Alerting : New Relic, AWS CloudWatch, Google Cloud Stackdriver, Squadcast Agile/Scrum practices utilizing JIRA Responsibilities SRE Leadership & Strategy: Set clear goals for the SRE team and partner with Engineering leadership to align platform initiatives with business objectives. Reliability & Observability (SLA/SLO): Lead the definition and enforcement of SLAs, SLIs, and SLOs. Architect observability frameworks to translate telemetry data into actionable roadmaps that reduce toil and enhance resilience. Core Engineering & Performance: Take ownership of critical code components (i.e., Queues, Enrollments) and lead efforts to identify bottlenecks, optimize performance, and improve code quality across the engineering department. Security by Design: Champion infrastructure security. Partner with InfoSec to define hardening standards, manage perimeter defense (WAF/DDoS), and automate vulnerability remediation within the CI/CD pipeline. Incident Command: Participate in the 24x7 on-call rotation and lead post-incident reviews (RCAs), ensuring action items are implemented to improve MTTR and prevent recurrence. Mentorship: Empower developers with better tooling and guidance on performant coding practices, fostering a culture of collaboration and reliability and "you build it, you run it". Required Skills Experience & Engineering 10+ years of engineering experience, with 5+ years specifically developing Ruby on Rails applications. Expertise in Cloud Computing (AWS/GCP) and Infrastructure as Code (Terraform/Ansible). Strong proficiency with SQL databases (PostgreSQL) and the ability to quickly navigate and optimize complex, unfamiliar codebases. SRE & Operations Deep Observability: Proven experience designing monitoring solutions (Datadog, New Relic, Prometheus) based on the "Golden Signals". SLO Governance: Demonstrated ability to define SLIs/SLOs from scratch, negotiate Error Budgets, and use data to balance feature velocity with reliability. Security Focus: Experience securing cloud environments and container platforms (Kubernetes), including hands-on management of WAF rules and edge security. Incident Management: Experience leading post-incident reviews (RCAs) and implementing action items that directly improve MTTR (Mean Time to Recovery) and MTTD (Mean Time to Detection). Leadership Proven experience leading technical teams, mentoring engineers, and working in a team-oriented, collaborative environment with strong communication skills. Documentation & Training: Skilled in documenting solutions and training operational teams on how to effectively support and maintain systems. Proactive Problem-Solving: Demonstrated ability to communicate clearly, seek help proactively, and take ownership of tasks, leading them to completion. Bonus Skills Automation Tools: Experience in developing solutions using server automation tools such as Terraform, Ansible. CI/CD Expertise: Experience in writing and maintaining CI/CD pipelines and services. Kubernetes: Experience in building, deploying, and optimizing Kubernetes-based infrastructure Perimeter Defense: Experience configuring and managing Web Application Firewalls (WAF) (e.g., Cloudflare, AWS WAF, Akamai) and DDOS protection mechanisms. Education Bachelor's degree in Computer Science or related technical field BENEFITS Medical - 100% of employee premiums for selected individual plans Dental - 100% of employee premiums covered Vision - 100% of employee premiums covered LinkedIn Learning 401(k) plus matching (US Based Only) Unlimited PTO Calm subscription Annual Company Retreat Intellum is an equal-opportunity employer. We're committed to building an inclusive team that celebrates diversity in people, perspectives, and backgrounds regardless of race, color, national origin, gender, sexual orientation, age, religion, disability, citizenship, veteran status, or any other protected status. We encourage you to apply for an open position and if you have questions about whether or not your job experience and skill set meet the requirements for a specific role, reach out to us directly at View email address on click.appcast.io. If you are an individual applying from CA, NY, CO, CT, MD, NV, or RI, please reach out to View email address on click.appcast.io to inquire about specific pay ranges.

Apply

Vacancy posted 8 hours ago

Similar jobs that could be interesting for youBased on the Lead Site Reliability Engineer in Atlanta, GA vacancy

Senior Site Reliability Engineer Reliability
The Home Depot is hiring a Senior Software Engineer for Site Reliability in Atlanta, Georgia. This role focuses on enhancing product reliability and drives platform stability with automated solutions. Responsibilities include software development and deployment, collaborating...
Suggested
The Home Depot
Atlanta, GA
2 days ago
Senior Software Engineer - Site Reliability Engineering (Remote)
...Position Purpose:The Senior Software Engineer for Site Reliability Engineering (Store Systems Enablement) builds and operates the internal platforms... ...in on-call rotation for observability infrastructure. Lead and contribute to blameless post-mortems. Design and execute...
Suggested
Work experience placement
Local area
Remote work
Shift work
Home Depot
Atlanta, GA
4 days ago
Site Reliability Engineer
...Description Job Description Canonical is a leading provider of open source software and... ...as public cloud, data science, AI, engineering innovation, and IoT. Our customers... ...profitable, and growing. We are hiring a Site Reliability Engineer Our goal is to perfect...
Suggested
Work at office
Local area
Remote work
Work from home
Worldwide
Canonical
Atlanta, GA
8 hours ago
Site Reliability Engineer I
$104k - $130k
...infrastructure as well as help improve the reliability, quality of services and overall... ...recovery. You’ll collaborate or embed with engineering teams, helping them to improve the... ...more about our locations by visiting our site. Compensation & Benefits The base...
Suggested
Full time
Work experience placement
AppFolio
Atlanta, GA
1 day ago
Sr Staff Site Reliability Engineer
...About the RoleYou'll own the reliability posture of a large-scale healthcare platform. That... ...production-ready. You'll work alongside software engineers and security engineers who are building... ...in on-call rotation and lead incident response for platform issuesPartner...
Suggested
Permanent employment
Satine Technologies
Atlanta, GA
1 day ago
Staff Site Reliability Engineer
$180k - $220k
...intelligence platform used by some of the world's leading software organizations – Netflix,... ...a technical and operational leader for reliability across Develocity. You'll help define... ...-on role with broad influence across engineering, cloud platform, and customer-facing...
Full time
Remote work
Work from home
Shift work
Gradle Technologies
Atlanta, GA
a month ago
Site Reliability Engineer: Build Secure, High-Performance Apps
GoHealth Urgent Care is hiring a Site Reliability Engineer in Atlanta, Georgia. This role focuses on maintaining and enhancing the reliability, security, and performance of web and mobile applications. You will be responsible for managing Azure DevOps pipelines and collaborating...
GoHealth Urgent Care
Atlanta, GA
4 days ago
Site Reliability Engineer
$117k - $209.33k
## Site Reliability EngineerApplylocations: Atlanta, GA, USAtime type: Full timeposted on: Posted Todayjob requisition id: 26WD98046**Job Requisition... ...exciting new opportunity has opened for a Site Reliability Engineer within the Autodesk PDMS Platform SRE team. The successful...
Permanent employment
Autodesk, Inc.
Atlanta, GA
3 days ago
Senior Site Reliability Engineer
$164.3k - $222.3k
.... This position is based in our Reston, VA office and offers a hybrid work schedule. Verisign is hiring a Senior Site Reliability Engineer to help lead a team responsible for building, managing, maintaining, and scaling the Linux infrastructure on which our mission‑critical...
Work at office
Flexible hours
The Association of Technology, Management and Applied Engine...
Atlanta, GA
2 hours ago
Site Reliability Engineer
...valuable than ever - And that’s just how we’ll make you feel.The Site Reliability Engineer is responsible for maintaining and enhancing the... ...resolution, performing root cause analysis and minimizing downtime.Lead efforts to identify, remediate, and document security...
Work experience placement
Work at office
Local area
GoHealth Urgent Care
Atlanta, GA
4 days ago
Site Reliability Engineer (DevOps) (Remote)
Who we’re looking for? A Site Reliability DevOps engineer working as part of the high-performing Operations team (SRE) growing their knowledge and skillset. Helps maintain existing business-critical applications and infrastructure while recommending technical and process...
Remote job
Monday to Friday
Braves Technologies
Atlanta, GA
12 hours ago
Site Reliability Engineer
Job Title :- Site Reliability Engineer (SRE) Employment Type :- W2 Duration :- Long Term Visa Type :- All Visa applicable which are ready for W2 Location :- Atlanta, GA (Onsite) Job Description We are seeking a highly skilled Site Reliability Engineer (SRE)with expertise...
Highbrow LLC
Atlanta, GA
2 days ago
4621 - Sr. Site Reliability Engineer I
Summary: As a Sr. Site Reliability Engineer, you are instrumental in helping make our client’s Kubernetes-centric ProArchive application resilient... ...the most popular communications platforms and the world’s leading cloud infrastructure platforms. They use the latest in AI/...
Lexicon Solutions
Atlanta, GA
12 hours ago
Senior Site Reliability Engineer I Atlanta, Georgia, United States Atlanta, Georgia
...improve cloud infrastructure reliability, scalability, and operational... ...code in Go, Python, or similar. Lead design reviews and set code... ...and tools that enable engineering teams to provision services rapidly... ...engineering, cloud infrastructure, or site reliability engineering....
Axon Enterprise
Atlanta, GA
4 days ago
Site Reliability Engineer
...- AWS, Google Cloud, and Azure is a plus - CI/CD Automation, Database Management. The Technical Support Specialist in Site Reliability Engineering (SRE) will be responsible for ensuring the reliability and stability of the systems and applications. The role involves...
TechDigital Group
Atlanta, GA
12 hours ago
Senior / Staff Site Reliability, Platform Engineering
...cloud-native systems. As a Staff Platform Engineer, you will play a critical role in... ...technical leadership role. You will own reliability for major platform domains, design scalable... ...Infrastructure Development, Platform Engineering, or Site Reliability Engineering role, with a...
Full time
Saviynt
Atlanta, GA
10 days ago
Database Site Reliability Engineer
$135.8k - $183.8k
...Postgres DBs in support of key services that make the internet work. The ideal candidate will work with other DBA SREs, application engineers, Infrastructure teams, Security and Project Managers maintaining critical internet infrastructure. Responsibilities Maintain and...
Work experience placement
Work at office
Flexible hours
The Association of Technology, Management and Applied Engine...
Atlanta, GA
12 hours ago
Site Reliability Engineering (SRE) Architect
Overview Site Reliability Engineering (SRE) Architect Location: Atlanta, GA Duration: 12 Months+ Extension Hourly Rate: Depending on Experience... ...behaviour Reliability Strategy & Design: With overall maturity lead the definition and implementation strategy for Service...
Hourly pay
Permanent employment
Contract work
Local area
Early shift
Quantum Technologies. LLC
Atlanta, GA
1 day ago
Director, Software Engineering (Site Reliability Engineering)
$300k - $360k
...without any hidden fees or compounding interest.As a Director of Site Reliability Engineering, you will own execution for reliability, availability, and... ...risks across the organization.You will hire, grow, and lead a diverse, global team of SREs, systems engineers, and full...
Work at office
Remote work
Flexible hours
Affirm
Atlanta, GA
3 days ago
Site Reliability Engineer (Senior / Staff)
$130k - $150k
...You'll work alongside software engineers and security engineers who... ...and improve CI/CD pipelines - reliability, deployment safety, rollback... ...Participate in on‑call rotation and lead incident response for... ...figure it out together. Senior Site Reliability Engineer Salary:...
Permanent employment
Flexible hours
Satine Technologies
Atlanta, GA
4 days ago
AEM Site Reliability Engineer: Cloud & CI/CD Expert
A leading IT solutions provider in Atlanta is looking for a Site Reliability Engineer (SRE) with expertise in Adobe Experience Manager (AEM) and DevOps practices. The successful candidate will maintain and enhance the reliability of AEM applications while implementing scalable...
Highbrow LLC
Atlanta, GA
2 days ago
Lead Java Integrator- INTL India
$10 - $13 per hour
...We are seeking a Lead Java Integrator to design and build APIdriven integration services that expose and orchestrate enterprise data... ...efforts across enterprise platforms and systems Ensure API reliability, performance, and data integrity Set and enforce coding standards...
Insight Global
Atlanta, GA
3 days ago
Engineer Lead, Software
...Software Engineer Lead We are FIS. Our technology powers the world's economy and our teams bring innovation to life. We champion diversity... ...with product, quality, and operations teams to deliver reliable releases Identify and resolve performance, reliability, and...
Fisglobal
Atlanta, GA
12 hours ago
Platform Engineer Lead
$139.74k - $209.62k
...Platform Engineer Lead PLEASE NOTE: This position is not eligible for current or future visa sponsorship Location : This role requires associates to be in-office 1 - 2 days per week, fostering collaboration and connectivity, while providing flexibility to support...
Temporary work
Work experience placement
Work at office
Local area
2 days per week
1 day per week
Elevance Health
Atlanta, GA
3 days ago
Site Reliability Engineer - Cloud Infra & Automation
Autodesk, Inc. is seeking a Site Reliability Engineer based in Atlanta, GA. This role involves architecting solutions for SaaS applications, managing cloud infrastructure, and ensuring reliability and performance. Candidates should have a background in DevOps, strong AWS...
Autodesk, Inc.
Atlanta, GA
3 days ago
Application Developer, Lead
$79.4k
...Application Developer, Lead Georgia State University's Instructional Innovation and Technology (IIT) division is seeking a highly skilled... ...-on experience will be considered. NOTE: This role requires on-site work. Remote or hybrid work options are not available for this...
Full time
Work at office
Remote work
Shift work
Georgia State University
Atlanta, GA
1 day ago
Lead SCADA / HMI Systems Integrator - Ignition HMI/SCADA
Lead SCADA / HMI Systems Integrator - Ignition HMI/SCADA Lead SCADA / HMI Systems Integrator - Ignition HMI/SCADA A growing engineering and technology solutions firm is seeking an experienced SCADA / HMI Systems Integrator to join its expanding automation team supporting...
Flexible hours
Liberty Personnel Services, Inc.
Atlanta, GA
3 days ago
Lead Security Systems Integrator & Field Tech - Profit Sharing
$70k - $90k
Technology Partner is looking for a Lead Security Integration Technician in Atlanta, GA. The ideal candidate will have over four years of installation or service experience with access control, CCTV/IP video, and intrusion systems. Responsibilities include leading field...
For subcontractor
Technology Partner
Atlanta, GA
2 days ago
Senior Security Systems Integrator & Lead Installations
$70k - $90k
Technology Partner, LLC is seeking a Lead Security Integration Technician in Atlanta, GA. The ideal candidate will have over 4 years... ..., benefits, and opportunities for career advancement in systems engineering or project management. #J-18808-Ljbffr Technology Partner, LLC
Technology Partner, LLC
Atlanta, GA
3 days ago
Lead AI Governance & Cloud Architect
...environments. We are currently conducting a confidential search for a Lead AI Governance & Cloud Architect to help architect and... ...~7+ years of experience in cloud architecture, infrastructure engineering, or platform engineering ~ Proven experience deploying and governing...
Temporary work
TRC Talent Solutions
Sandy Springs, GA
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Lead Site Reliability Engineer. Be the first to apply!