Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Remote Senior Site Reliability Engineer

GrabJobs

Job description What are we building?
Hard Rock Digital is a team focused on becoming the best online sportsbook, casino, and social gaming company in the world. We’re building a team that resonates passion for learning, operating, and building new products and technologies for millions of consumers. We care about each customer interaction, experience, behavior, and insight and strive to ensure we’re always acting authentically.

Rooted in the kindred spirits of Hard Rock and the Seminole Tribe of Florida, Hard Rock Digital taps a brand known the world over as the leader in gaming, entertainment, and hospitality. We’re taking that foundation of success and bringing it to the digital space - ready to join us?

What’s the position?
We are looking for a Senior Site Reliability Engineer who combines deep infrastructure expertise with a forward-thinking approach to AI-driven operations. In this role you will maintain and improve the reliability, scalability, and performance of our Java-based applications while pioneering the use of large language models (LLMs), agentic workflows, and intelligent automation to transform how we monitor, respond to, and prevent incidents.

You will design and build autonomous and semi-autonomous AI agents that consume observability data, triage alerts, generate runbooks, automate incident response steps, and surface actionable insights—reducing toil and accelerating mean time to resolution. This is a hands-on engineering role for someone who is equally comfortable tuning a JVM, writing PromQL, and prototyping an agentic pipeline with tool-calling LLMs.

Key Responsibilities
Application Reliability & Performance
Ensure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environment.

Troubleshoot and resolve complex issues across production and non-production environments.

Participate in pre- and post-deployment performance testing and monitoring to continuously improve application performance.

Optimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scaling.

Monitoring, Observability & AIOps
Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alerting.

Implement and refine observability strategies that enhance visibility into application and infrastructure health.

Create and maintain dashboards, alerts, and log queries for comprehensive system health monitoring.

Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction.

AI & Agentic Workflow Engineering
Design, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization.

Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approval.

Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents.

Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems.

Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improving.

Champion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization.

Incident Management & Root Cause Analysis
Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence.

Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidents.

Document and share lessons learned, contributing to a culture of continuous improvement.

Automation & Toil Reduction
Identify repetitive operational workflows and engineer AI-augmented or fully automated replacements.

Build self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural language.

Measure and report on toil reduction metrics to quantify the impact of automation initiatives.

Collaboration & Cross-functional Support
Work closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilities.

Collaborate with DevOps and NOC teams to support the application platform.

Communicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholders.

Provide feedback on application performance, potential improvements, and observability metrics.

Why This Role Is Different
This is not a traditional SRE position with AI bolted on as an afterthought. We are building a team that treats AI and agentic automation as core competencies—on par with Kubernetes expertise or observability design. You will have the autonomy to experiment with cutting-edge AI tools, the backing of leadership to deploy them in production, and a mandate to measurably reduce operational toil through intelligent systems.
Job requirements What are we looking for?
Core SRE & Infrastructure (Required)
Degree in Computer Science or a related field, or equivalent professional experience.

5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems.

3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and security.

Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management.

Proficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshooting.

Advanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection.

Proficiency in PromQL and experience with Loki for log aggregation and analysis.

Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization.

Cloud platform expertise (AWS preferred; GCP or Azure also valued).

Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or Ansible.

ArgoCD proficiency for GitOps workflows and continuous deployment.

Strong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automation.

Proven track record with on-call rotations, incident response, and root cause analysis.

AI, Automation & Agentic Systems (Required)
1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context.

Demonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasks.

Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines.

Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent).

Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examples.

Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows.

Experience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agents.

Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems.

Preferred / Bonus
Hands-on experience with vector databases (Pinecone, Weaviate, pgvector) for RAG-based knowledge retrieval.

Experience with LLM evaluation frameworks (e.g., Galileo, LangSmith, Braintrust) for monitoring agent quality in production.

Contributions to open-source AI/ML or SRE tooling projects.

Background in data engineering or ML pipelines that complements SRE responsibilities.

Soft Skills
Strong communication skills (written and verbal) with the ability to translate complex AI and infrastructure concepts for diverse audiences.

Proactive problem-solver with a bias toward automation and continuous improvement.

Ability to mentor junior team members on both traditional SRE practices and emerging AI-driven approaches.

Positive attitude and openness to constructive feedback.

What’s in it for you?
We offer our employees more than just competitive compensation. Our team benefits include:
Competitive pay and benefits

Flexible vacation allowance

A hybrid / remote working environment

Startup culture backed by a secure, global brand

Roster of Uniques
We care deeply about every interaction our customers have with us, and trust and empower our staff to own and drive their experience. Our vision for our business and customers is built on fostering a diverse and inclusive work environment where regardless of background or beliefs you feel able to be authentic and bring all your talent into play. We want to celebrate you being you (we are an equal opportunity employer).
All done! Your application has been successfully submitted!
Other jobs Youve already applied for this job We appreciate your interest in this position. Unfortunately, you have already applied for this job.

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Remote Senior Site Reliability Engineer in United States vacancy
  •  ...Joining a high-performing team remotely, the full-time Senior Site Reliability Engineer will own the reliability and automation of critical AI infrastructure, ensuring systems are resilient and secure while building automation tools to streamline operational workflows... 
    Remote work
    Senior
    Full time

    Virtual Vocations Inc

    United States
    5 days ago
  •  ...public cloud platform? Join our IaaS Site Reliability Engineering (SRE) team. We design, develop, and operate...  ...India is your office. We are a 100% remote-first team. We will support you, take...  ...in our SRE team: SRE I → SRE II → Senior → Senior II → Principal → Senior... 
    Remote work
    Senior
    Work at office

    Akamai

    New York, NY
    1 day ago
  • $65 - $75 per hour

     ...on our W2- no C2C, no exceptions Fully remote Key Responsibilities: Process customer requests...  ...Management tools. Description: As an Engineer 2, you will collaborate with management,...  ...automation across the IT organization. Seniority level Mid-Senior level Employment type... 
    Remote work
    Senior
    Contract work

    SBS Creatix

    New York, NY
    1 day ago
  •  ...Site Reliability Engineers are responsible for ensuring the availability, reliability, scalability, and performance of the firm’s most critical...  ...default. This is an on-site position located in Springfield, MO. Remote work is not an option for this position. Primary... 
    Remote work
    Senior
    Local area
    Flexible hours
    Shift work

    O'Reilly Technology Services, Inc.

    Pierce, ID
    3 days ago
  • $150k - $170k

     ...Senior Site Reliability Engineer – Zip Co Join to apply for the Senior Site Reliability Engineer role at Zip Co At Zip, we build cloud‑native software...  ...initiatives and mentor our engineering team. We offer a remote‑first opportunity for US‑based employees with the option... 
    Remote work
    Senior
    Casual work
    Work at office
    Flexible hours

    ZIP

    New York, NY
    3 days ago
  •  ...way IT organizations work. We are currently looking for a Senior Site Reliability Engineer to join our SRE team in the Platform Engineering...  ...millions of user endpoints. Location We are flexible on remote work from home for candidates located in the USA in the following... 
    Remote work
    Senior
    Permanent employment
    Work from home
    Flexible hours

    NinjaOne

    Tampa, FL
    2 days ago
  • $125k - $165k

     ...follow us on LinkedIn. About the Role We're looking for a Senior Site Reliability Engineer who genuinely enjoys the craft. Someone who takes pride...  ...k Paid Parental Leave Nine paid holidays & Unlimited PTO Remote working arrangements Please note the national salary range... 
    Remote work
    Senior
    Temporary work

    DexCare

    New York, NY
    2 days ago
  •  ...Senior Sre Unify is redefining go-to-market with state-of-the-art AI. As a Senior SRE, you'll tackle the scaling and reliability challenges that come with adding terabytes of data monthly and...  ...metrics, and alerting that give engineers clear visibility into system behavior... 
    Remote work
    Senior

    Unify

    United States
    4 days ago
  • Akamai Technologies GmbH is looking for a Senior Site Reliability Engineer in Cambridge, MA. This role involves designing and operating critical services that support the reliability and performance of Akamai Cloud infrastructure. Ideal candidates should have at least... 
    Remote job
    Senior

    Akamai Technologies GmbH

    Cambridge, MA
    1 day ago
  • $170k - $290k

     ...multimodality is critical for intelligence. This requires a massive, reliable, and performant GPU infrastructure that pushes the boundaries...  ...Come In We are looking for a hands-on, first-principles engineer who is fluent in Linux, comfortable operating close to the... 
    Remote work
    Senior

    Luma AI

    United States
    5 days ago
  • $150k - $200k

     ...Join to apply for the Senior Site Reliability Engineer role at Gradle Inc. Develocity is a first‑of‑its‑kind toolchain observability and acceleration...  ...speed up, troubleshoot, and optimize local developer and remote CI feedback loops. Our software is used by some of the... 
    Remote work
    Senior
    Full time
    Local area
    Work from home

    Gradle Inc.

    New York, NY
    1 day ago
  • $100k - $120k

    Attain Finance is looking for a Senior Site Reliability Engineer to enhance the reliability and operational excellence of our services. This role requires...  ...standardize our deployment processes. This is a fully remote position with a competitive salary range of $100,000 - $1... 
    Remote job
    Senior

    Attain Finance

    Greenville, SC
    2 days ago
  •  ...Senior Site Reliability Engineer – Azure Cloud Join to apply for the Senior Site Reliability Engineer role at Concord Technologies Concord Technologies...  ...engineering solutions in Azure Cloud, as part of a remote role, with occasional travel to headquarters in Seattle,... 
    Remote work
    Senior
    Full time
    Local area
    Immediate start
    Flexible hours

    Concord Technologies

    New York, NY
    1 day ago
  • $13 per hour

     ...building America's mortgage rails. About the Job You'll own reliability and operational excellence for Pylon's production systems....  ...scale as we grow. You'll build tooling that makes the entire engineering team more effective, establish on-call rotations and runbooks... 
    Remote work
    Senior

    Pylon

    United States
    4 days ago
  • $125.04k - $187.56k

     ...Digital and E-commerce, Technology and more. Overview The Site Reliability Engineer (SRE) III is responsible for ensuring the scalability, reliability...  ...includes 3 in-person days at our Chicago office and 2 remote days. Applicants must be currently authorized to work in... 
    Remote work
    Senior
    Full time
    Work at office
    Flexible hours

    ViziRecruiter

    Quincy, MA
    3 days ago
  •  ...join our small team focused on growth and productivity. The role involves scaling our platform and infrastructure while enhancing reliability and the overall developer experience. Ideal candidates will have strong expertise in distributed systems, cloud-native... 
    Remote job
    Senior

    BuildBuddy

    Palo Alto, CA
    3 days ago
  • $195k - $240k

     ...Senior Site Reliability Engineer San Francisco (Hybrid) At You.com, we are building the AI Search Infrastructure that powers modern AI systems...  ...00 technology stipend to support a portion of our hybrid/remote team's cell phone and internet expenses* ~$1,200 per... 
    Remote work
    Senior
    Full time
    Immediate start
    Work from home
    Flexible hours

    Y.O.U.

    San Francisco, CA
    5 days ago
  • $160k - $210k

     ...teamwork a cornerstone of our success. We are looking for a Senior Site Reliability engineer to work on expanding our global footprint of datacenters...  ...schedule of 3 days in office (Mon/Tue/Wed) and 2 days remote (Thursday/Friday). Responsibilities Design, implement,... 
    Remote work
    Senior
    Work at office
    Local area
    Immediate start

    GrabJobs

    United States
    13 hours ago
  •  ...Senior Site Reliability Engineer - Waltham, MA Dentsply Sirona is the world’s largest manufacturer of professional dental products and technologies...  ...services ASAP when downtime occurs. This role is partially remote, providing a mix of working remotely and in the office.... 
    Remote work
    Senior
    Work at office
    Immediate start
    Worldwide

    Wellspect HealthCare

    Waltham, MA
    3 days ago
  • $113.3k - $205.52k

     ...important to maintain our strong culture, achieve our goals, and thrive as #OneJamf. What youll do at Jamf: As a Senior Site Reliability Engineer, youll help us balance development velocity with the reliability our customers depend on. Youll partner with... 
    Remote work
    Senior
    Work at office
    Worldwide
    Flexible hours

    GrabJobs

    United States
    2 days ago
  •  ...looking for a highly motivated, diligent, and skillful Site Reliability Engineer to join the Cyber Security Engineering (CSE) Team....  ...on people’s lives. This position can be remote anywhere in the U.S. The Senior Site Reliability Engineer will be responsible for... 
    Remote work
    Senior
    Temporary work

    PowerToFly

    Springfield, IL
    2 days ago
  • $185k - $227k

     ...Remote - United States; United States of America THE COMPANY Juul Labs's mission is...  ...purpose and we are hiring the world’s best engineers, scientists, designers, product...  ...details. ROLE AND RESPONSIBILITIES A Senior Site Reliability Engineer (SRE) is expected to own the... 
    Remote work
    Senior

    JUUL Labs

    New York, NY
    1 day ago
  • $86k - $148k

     ...make a difference. Position Summary We’re looking for a Senior Engineer to lead complex initiatives and elevate our managed services...  ...incidents, implement monitoring solutions, and improve system reliability. Security-First Mindset: Experienced in aligning... 
    Remote work
    Senior
    Work at office
    Immediate start
    Flexible hours

    GrabJobs

    United States
    1 day ago
  • $130k - $165k

     ...Job Title: Senior Software Engineer Company: Snapsheet Job Location: USA, Remote Job Type: Full-time, direct hire Job Department: Technology Team: Site Reliability Engineering About Snapsheet Snapsheet exists to simplify claims. We leverage... 
    Remote work
    Senior
    Full time
    Temporary work
    Local area
    Visa sponsorship
    Work visa
    Flexible hours

    Snapsheet

    New York, NY
    5 days ago
  •  ...Chicago/ Evanston, IL. This will be a fully remote role, however it is required that this individual...  ...interview. Our client is seeking a Senior SRE with proven industry experience to join our remote-based Engineering team. Our teams are collaborative and forward... 
    Remote work
    Senior

    Insight Global

    Boca Raton, FL
    2 days ago
  •  ...Site Reliability Engineer CodeRabbit is an innovative research and development company focused on building extraordinarily productive human-machine collaboration systems. Our primary goal is to create the next generation of Gen AI-driven code reviewers: a symbiotic... 
    Remote work
    Senior

    CodeRabbit

    United States
    4 days ago
  • $148k - $235.75k

     ...organization where you will be working as a Senior SRE Engineer. The position will be part of a fast-...  ...prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering...  ...IPMI tools for hardware provisioning, remote access, and troubleshooting. Knowledge... 
    Remote work
    Senior

    NVIDIA

    Santa Clara, CA
    1 day ago
  • $150k - $175k

     ...Site Reliability Engineer At ASAPP, our mission is simple: deliver the best AI-powered customer experience—faster than anyone else. To achieve...  ...View, Latin America, and India—embracing both hybrid and remote work to bring the best minds together, wherever they are. If... 
    Remote work
    Senior

    ASAPP

    Mountain View, CA
    1 day ago
  •  ..., resources, tips and trends from the DevOps World. As a Senior Site Reliability Engineer, you will play a pivotal role in ensuring the reliability...  ...practices and looking for a challenging and rewarding role in a remote setting, we encourage you to apply. This is an... 
    Remote work
    Senior
    Flexible hours

    DevOpsChat

    New York, NY
    1 day ago
  •  ...Site Reliability Engineer There are NO limits to your career: come shape the future and be part of a truly unique global culture at OutSystems...  ...Singapore, and includes a thriving, worldwide community of remote employees. Our customers are some of the world's most... 
    Remote work
    Senior
    Immediate start
    Worldwide

    OutSystems

    Menlo Park, CA
    5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Remote Senior Site Reliability Engineer. Be the first to apply!