Remote Senior Site Reliability Engineer

GrabJobs

Job description What are we building?
Hard Rock Digital is a team focused on becoming the best online sportsbook, casino, and social gaming company in the world. We’re building a team that resonates passion for learning, operating, and building new products and technologies for millions of consumers. We care about each customer interaction, experience, behavior, and insight and strive to ensure we’re always acting authentically.

Rooted in the kindred spirits of Hard Rock and the Seminole Tribe of Florida, Hard Rock Digital taps a brand known the world over as the leader in gaming, entertainment, and hospitality. We’re taking that foundation of success and bringing it to the digital space - ready to join us?

What’s the position?
We are looking for a Senior Site Reliability Engineer who combines deep infrastructure expertise with a forward-thinking approach to AI-driven operations. In this role you will maintain and improve the reliability, scalability, and performance of our Java-based applications while pioneering the use of large language models (LLMs), agentic workflows, and intelligent automation to transform how we monitor, respond to, and prevent incidents.

You will design and build autonomous and semi-autonomous AI agents that consume observability data, triage alerts, generate runbooks, automate incident response steps, and surface actionable insights—reducing toil and accelerating mean time to resolution. This is a hands-on engineering role for someone who is equally comfortable tuning a JVM, writing PromQL, and prototyping an agentic pipeline with tool-calling LLMs.

Key Responsibilities
Application Reliability & Performance
Ensure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environment.

Troubleshoot and resolve complex issues across production and non-production environments.

Participate in pre- and post-deployment performance testing and monitoring to continuously improve application performance.

Optimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scaling.

Monitoring, Observability & AIOps
Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alerting.

Implement and refine observability strategies that enhance visibility into application and infrastructure health.

Create and maintain dashboards, alerts, and log queries for comprehensive system health monitoring.

Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction.

AI & Agentic Workflow Engineering
Design, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization.

Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approval.

Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents.

Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems.

Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improving.

Champion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization.

Incident Management & Root Cause Analysis
Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence.

Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidents.

Document and share lessons learned, contributing to a culture of continuous improvement.

Automation & Toil Reduction
Identify repetitive operational workflows and engineer AI-augmented or fully automated replacements.

Build self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural language.

Measure and report on toil reduction metrics to quantify the impact of automation initiatives.

Collaboration & Cross-functional Support
Work closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilities.

Collaborate with DevOps and NOC teams to support the application platform.

Communicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholders.

Provide feedback on application performance, potential improvements, and observability metrics.

Why This Role Is Different
This is not a traditional SRE position with AI bolted on as an afterthought. We are building a team that treats AI and agentic automation as core competencies—on par with Kubernetes expertise or observability design. You will have the autonomy to experiment with cutting-edge AI tools, the backing of leadership to deploy them in production, and a mandate to measurably reduce operational toil through intelligent systems.
Job requirements What are we looking for?
Core SRE & Infrastructure (Required)
Degree in Computer Science or a related field, or equivalent professional experience.

5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems.

3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and security.

Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management.

Proficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshooting.

Advanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection.

Proficiency in PromQL and experience with Loki for log aggregation and analysis.

Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization.

Cloud platform expertise (AWS preferred; GCP or Azure also valued).

Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or Ansible.

ArgoCD proficiency for GitOps workflows and continuous deployment.

Strong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automation.

Proven track record with on-call rotations, incident response, and root cause analysis.

AI, Automation & Agentic Systems (Required)
1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context.

Demonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasks.

Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines.

Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent).

Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examples.

Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows.

Experience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agents.

Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems.

Preferred / Bonus
Hands-on experience with vector databases (Pinecone, Weaviate, pgvector) for RAG-based knowledge retrieval.

Experience with LLM evaluation frameworks (e.g., Galileo, LangSmith, Braintrust) for monitoring agent quality in production.

Contributions to open-source AI/ML or SRE tooling projects.

Background in data engineering or ML pipelines that complements SRE responsibilities.

Soft Skills
Strong communication skills (written and verbal) with the ability to translate complex AI and infrastructure concepts for diverse audiences.

Proactive problem-solver with a bias toward automation and continuous improvement.

Ability to mentor junior team members on both traditional SRE practices and emerging AI-driven approaches.

Positive attitude and openness to constructive feedback.

What’s in it for you?
We offer our employees more than just competitive compensation. Our team benefits include:
Competitive pay and benefits

Flexible vacation allowance

A hybrid / remote working environment

Startup culture backed by a secure, global brand

Roster of Uniques
We care deeply about every interaction our customers have with us, and trust and empower our staff to own and drive their experience. Our vision for our business and customers is built on fostering a diverse and inclusive work environment where regardless of background or beliefs you feel able to be authentic and bring all your talent into play. We want to celebrate you being you (we are an equal opportunity employer).
All done! Your application has been successfully submitted!
Other jobs Youve already applied for this job We appreciate your interest in this position. Unfortunately, you have already applied for this job.

Apply

Vacancy posted 4 days ago

Similar jobs that could be interesting for youBased on the Remote Senior Site Reliability Engineer in United States vacancy

Senior Site Reliability Engineer
$110k - $137.49k
Role Description The Sr Site Reliability Engineer, Release will prototype, write, maintain, and test code... ..., including production ~Work with senior team members to understand stakeholder... ...in on-call rotation Benefits ~Remote-first environment ~Unlimited paid...
Remote work
Senior
Full time
Work experience placement
Alkami Technology
Remote
3 days ago
Senior Site Reliability Engineer
Role Description As a Senior Site Reliability Engineer you will champion all things pertaining to reliability at Okta for Auth0. Working closely with... ...to work as a team, but is able to work effectively in a remote environment where tasks may be self-driven. ~Knowledge...
Remote work
Senior
Full time
Okta
Remote
1 day ago
Senior Site Reliability Engineer
...Seeking a full-time Senior Site Reliability Engineer, this remote position will manage the reliability, scalability, and performance of mission-critical services, driving operational excellence and automation while serving as a Datadog expert. Key responsibilities Design...
Remote work
Senior
Full time
Virtual Vocations Inc
United States
4 days ago
Senior Site Reliability Engineer
...and Service-Level Agreements (SLAs) to ensure that systems meet reliability and performance targets ~Monitoring Tools like New Relic,... ...packages we help ship, are all over the world. ~Through our remote-first program, “Shippos Everywhere”, our roles can be based...
Remote work
Senior
Full time
Shippo
Remote
2 days ago
Senior Site Reliability Engineer
...infra has to match. The role We're looking for a Senior SRE to own the reliability, scalability, and operational posture of Satsuma's multi... ...-assisted development workflows Partner closely with engineering on reliability reviews and architecture decisions...
Remote work
Senior
Satsuma
United States
3 days ago
Senior Site Reliability Engineer
$54k - $150k
Role Description As Senior Site Reliability Engineer for Remote Build, you'll own the operational excellence and infrastructure strategy that makes Build's platform reliable, performant, and safe for customers. You'll report to the Engineering Manager and work closely with...
Remote work
Senior
Full time
Local area
Home office
Flexible hours
Remote
Remote
4 days ago
Senior Site Reliability Engineer
...intelligent automation and modern engineering. We are seeking a Senior SRE Engineer who will be a... ..., observability, and reliability across Sleek’s products... ...progressive experience in Site Reliability Engineering (SRE... .... Benefits ~Fully remote role with flexible working...
Remote work
Senior
Full time
Flexible hours
Sleek
Remote
6 days ago
Senior Site Reliability Engineer
$54k - $150k
Role Description As Senior Site Reliability Engineer for Remote Build, you'll own the operational excellence and infrastructure strategy that makes Build's platform reliable, performant, and safe for customers. You'll report to the Engineering Manager and work closely with...
Remote work
Senior
Full time
Local area
Immediate start
Home office
Flexible hours
Referral Board
Remote
7 hours ago
REMOTE Senior Site Reliability Engineer
$100k - $115k
...Software And Systems Engineer Why you will love this job: Great opportunity to use... ...benefits including 6% match on 401K! Remote WFH role. Salary: $100,000 - $115,00... ...strategy across platform Troubleshoot site down issues and respond to emergency outages...
Remote work
Senior
Work from home
MRINetwork
United States
1 day ago
Senior Site Reliability Engineer (SRE)
Role Description We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering... ...opportunities for career growth ~Private health insurance ~A remote-friendly culture that promotes work-life balance ~...
Remote work
Senior
Long term contract
Full time
Flexible hours
Devsu
Remote
3 days ago
Senior Site Reliability Engineer
...Sports & Entertainment Digital Products division is seeking a Senior Site Reliability Engineer to help drive the reliability, scalability, and usability... ...fans worldwide. This role is designated as fully remote and is expected to be performed from a non-Versant worksite...
Remote work
Senior
Full time
Local area
Worldwide
Versant
Remote
5 days ago
Senior II Site Reliability Engineer
...Do you want to shape reliability practices for a new AI inference platform? Are you a senior technical leader who drives solutions... ...architecture decisions with product engineering teams, and shape SRE... ...energize and inspire you! #LI-Remote Job Info Job Identification 2...
Remote work
Senior
Permanent employment
Work at office
Work from home
Worldwide
Flexible hours
Akamai
Poland, NY
3 days ago
Senior Site Reliability Engineer
Role Description We’re looking for a Senior Platform Engineer to design, build, and operate the core... ...and agent orchestration to routing, reliability, and observability. You will partner... ...healthcare environments. Location: Open to remote or San Francisco Bay Area, Nashville...
Remote work
Senior
Full time
Optura
Remote
2 days ago
Senior Site Reliability Engineer
...The Role We're looking for a Senior Site Reliability Engineer to own the reliability, scalability, and operational excellence of the production systems that power Nectar's platform. We run high-volume data ingestion pipelines and real-time AI agents on top of a fast...
Remote work
Senior
Nectar Social
United States
4 days ago
Senior Site Reliability Engineer
...Senior Sre Unify is redefining go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data monthly and... ...metrics, and alerting that give engineers clear visibility into system behavior...
Remote work
Senior
Unify
United States
5 days ago
Remote Senior Site Reliability Engineer
...Senior Site Reliability Engineer (Enterprise Platform) Location: Remote - US - Open to Europe if happy to overlap with EST Compensation: Competitive We are a high-growth software company supporting the development of a premier open-source, EVM-compatible public...
Remote work
Senior
Contract work
Currently hiring
GrabJobs
United States
2 days ago
Remote Senior Site Reliability Engineer
$141.8k - $195k
...massive, fast‑moving market. With a global workforce, we’re remote‑first and grounded in a simple idea: software is a people... ...herd. Why You’ll Love This Role Cribl Inc is seeking a Senior Site Reliability Engineer to join our mission where you will unlock the value of...
Remote work
Senior
Temporary work
GrabJobs
Newark, NJ
2 days ago
Senior Site Reliability Engineer
...Site Reliability Engineers (SREs) are essential to PandaDoc’s success, ensuring customers receive a reliable service with minimal downtime. The... ...customers. Check out our LinkedIn to learn more. Benefits: Remote‑first approach with the option for hybrid work from our offices...
Remote work
Senior
Contract work
Dormont Manufacturing Company
Poland, NY
1 day ago
Senior Site Reliability Engineer, Infrastructure
$125k - $135k
...Description Vultr is seeking a highly skilled and experienced Senior Site Reliability Engineer to build and own the observability pipeline for the... ...years + Anniversary Bonus each year. ~$500 stipend for remote office setup in first year + $400 each following year....
Remote work
Senior
Full time
Work at office
Immediate start
Vultr
Remote
2 days ago
Senior Cluster Site Reliability Engineer
$15k
...Senior Cluster Site Reliability Engineer (SRE) Voleon is a technology company that applies state-of-the-art AI and machine learning techniques to real-world problems in finance. For nearly two decades, we have led our industry and worked at the frontier of applying...
Remote work
Senior
Work at office
Local area
The Voleon Group
United States
5 days ago
Senior Site Reliability Engineer
...Senior Site Reliability Engineer Want to help make a better world? As a Senior Site Reliability Engineer at Autodesk, you can help us build and operate reliable, secure, and scalable cloud services for Autodesk GovCloud products. As part of a new SRE team supporting...
Remote work
Senior
Autodesk
United States
2 days ago
Senior Site Reliability Engineer
...Senior Site Reliability Engineer Company Overview: Arctiq is a global, intelligence-driven technology services company delivering professional... ...value to clients across diverse industries. This is a remote, contract opportunity for a project Arctiq is delivering...
Remote work
Senior
Contract work
Arctiq
United States
2 days ago
Remote Senior Site Reliability Engineer
$141k
...committed to providing our customers with reliable and secure services so we are expanding our central Site Reliability Engineering team. You will be responsible for building... ...communication and interpersonal skills. #LI-Remote The typical starting salary for this...
Remote work
Senior
Local area
Home office
Flexible hours
GrabJobs
Reno, NV
5 days ago
Senior Site Reliability Engineer
...valuable experience to our team. About the Role We are seeking an experienced and highly motivated Senior Site Reliability Engineer to serve as a key technical contributor supporting the Technical Director in advancing site reliability engineering,...
Remote work
Senior
Immediate start
i4DM
United States
5 days ago
Senior Site Reliability Engineer
$125.04k - $187.56k
...Digital and E-commerce, Technology and more. Overview The Site Reliability Engineer (SRE) III is responsible for ensuring the scalability, reliability... ...includes 3 in-person days at our Chicago office and 2 remote days. Applicants must be currently authorized to work in...
Remote work
Senior
Full time
Work at office
Flexible hours
ViziRecruiter
Salisbury, NC
3 days ago
Remote Senior Site Reliability Engineer
$160k - $240k
...way IT organizations work. We are currently looking for a Senior Site Reliability Engineer to join our SRE team in the Platform Engineering... ...availability of our services. Location - We are flexible on remote working from home, if you are located in the USA and...
Remote work
Senior
Permanent employment
Full time
Work from home
Relocation
Flexible hours
GrabJobs
Austin, TX
3 days ago
Senior Site Reliability Engineer
...conditions. Whether you're in a stadium, airplane, or remote military base, Ditto's peer-to-peer sync engine ensures devices stay connected and data stays... ...of our enterprise customers, we need experienced Site Reliability Engineers to ensure our infrastructure delivers...
Remote work
Senior
Flexible hours
Ditto
United States
2 days ago
Remote Senior Site Reliability Engineer
$86k - $148k
...make a difference. Position Summary We’re looking for a Senior Engineer to lead complex initiatives and elevate our managed services... ...incidents, implement monitoring solutions, and improve system reliability. Security-First Mindset: Experienced in aligning...
Remote work
Senior
Work at office
Immediate start
Flexible hours
GrabJobs
United States
3 days ago
Senior Site Reliability Engineer
$160k - $180k
...Senior Site Reliability Engineer Location: Remote Compensation: $160,000 - 180,000 per year, depending on experience and qualifications. Employment Type: Full-Time What you can expect as the Senior Site Reliability Engineer at Fortress... The...
Remote work
Senior
Full time
Temporary work
Local area
Flexible hours
Fortress Information Security
United States
2 days ago
Senior II Site Reliability Engineer
$146.4k - $263.6k
...Senior II Site Reliability Engineer Are you passionate about cutting edge technology? Do solving some of the Internet's most difficult content delivery challenges interest you? Join our SRE team! Our team uses large datasets to analyze and measure the performance...
Remote work
Senior
Work experience placement
Work at office
Akamai
United States
4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Remote Senior Site Reliability Engineer. Be the first to apply!