Sr. Site Reliability Engineer (SRE)

$165k - $225k

Moonlite AI

Sr. Site Reliability Engineer (SRE)

Chicago, IL or Remote

Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads. We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.

Your Role:

You will be instrumental in building and operating production-grade AI infrastructure with deep Kubernetes expertise at its core. Working closely with our systems engineers, network engineers, and platform engineering team, you'll architect and operate the Kubernetes infrastructure that powers our control plane and orchestrates compute, storage, and networking at scale. This role requires deep understanding of Kubernetes internals, custom resource definitions (CRDs), storage and network integrations, and building production-grade clusters from the ground up (not just deploying in managed environments). You'll ensure enterprise-grade reliability while establishing the automation, observability, and operational practices.

Job Responsibilities

Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads.
Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads.
Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains.
GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization.
Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement.
Infrastructure Automation: Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions.
Production Operations & Reliability: Manage production bare-metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments.
Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR.
Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads.

Requirements

Experience: 5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale.
Kubernetes Infrastructure Expertise: Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies.
Kubernetes Internals & Integration: Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes.
Linux Systems Experience: Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments.
Infrastructure Automation: Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead.
Networking Fundamentals: Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production.
Observability & Monitoring: Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems.
Reliability Practices: Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems.
Scripting & Automation: Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency.
Problem-Solving Under Pressure: Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages.
Collaboration & Communication: Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers.

Preferred Qualifications

Experience building custom Kubernetes operators or controllers for infrastructure orchestration
Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management
Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins
Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions
Experience with Kubernetes cluster federation or multi-cluster management
Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes
Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar)
Familiarity with configuration management at scale and GitOps practices
Understanding of security best practices for Kubernetes and bare-metal infrastructure
Experience operating infrastructure in regulated industries or co-located data center environments
Background supporting research institutions, technical computing environments, or enterprise AI infrastructure

Key Technologies

Kubernetes, Linux, Terraform, Ansible, Prometheus, Grafana, ELK Stack, Go, Python, Bash, NVIDIA GPU Technologies, High-Performance Networking, Enterprise Storage Systems

Why Moonlite

Build Critical Research Infrastructure: Your work will directly enable quantitative research teams and AI practitioners to push the boundaries of what's possible in financial modeling and AI research.
Enterprise Impact: Build and operate infrastructure that supports mission-critical research and AI workloads for leading financial institutions and research organizations.
Technical Excellence: Join an infrastructure team focused on delivering enterprise-grade reliability while pushing the boundaries of high-performance computing capabilities.
Hands-On Ownership: As part of our growing infrastructure team, you'll have significant ownership over critical systems and the autonomy to influence our operational practices and technology choices.
Industry Leadership: Work alongside experienced infrastructure professionals who have built and operated systems for the most demanding computing environments.

We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together.

Apply

Vacancy posted 2 days ago

Similar jobs that could be interesting for youBased on the Sr. Site Reliability Engineer (SRE) in United States vacancy

Senior Site Reliability Engineer (SRE)
...Senior Site Reliability Engineer (SRE) Our client is a global technology consulting and digital solutions company that enables enterprises across industries to reimagine business models, accelerate innovation, and maximize growth by harnessing digital technologies....
Senior
Local area
E-Solutions
New York, NY
3 days ago
Senior Site Reliability Engineer (SRE)
...risk—the leading cause of cybersecurity breaches—and build safer, more resilient organizations. The Role: As a Senior Site Reliability Engineer (SRE) at Dune Security, you will play a critical role in ensuring our platform's stability, scalability, and security. You will...
Senior
Full time
Work at office
Dune Security
New York, NY
1 day ago
Senior Site Reliability Engineer (SRE)
$100 per hour
...join early. As our Senior SRE, you'll be in charge of... ...create impact Improve reliability of our systems Build & maintain... ...frameworks and solutions to engineering problems Fast-moving: you... ...~401k benefits ~ On-site team culture - high collaboration...
Senior
Immediate start
Weekend work
DualEntry
New York, NY
3 days ago
Senior Software Engineer- Site Reliability Engineering (SRE)
$149.4k - $202k
...Senior Software Engineer- Site Reliability Engineering (SRE) DC, MD, VA, CA The Site Reliability Engineering discipline at Noctua Technology, LLC is a strategic force driving digital transformation. We treat operations as a software engineering challenge, focusing...
Senior
Remote work
Noctua Technology
Virginia, MN
4 days ago
Senior Site Reliability Engineer (SRE)
$130k - $135k
...VARITE is looking for a qualified Senior Site Reliability Engineer (SRE) - 619374 in Atlanta, GA About the client: An American Software company that provides a suite of tools intended to support the development and deployment of large-scale service-oriented software installations...
Senior
Full time
Varite
Atlanta, GA
2 days ago
Senior Site Reliability Engineer (SRE)
$92.7k - $203.94k
...while managing SLOs, SLAs, and error budgets to ensure service reliability and performance. Performance & Reliability Engineering Partner with engineering, infrastructure, and operations teams to embed SRE best practices, improve application resiliency, and optimize performance...
Senior
Hourly pay
Full time
Temporary work
Local area
CVS Health
Scottsdale, AZ
17 hours ago
Sr Mgr, Site Reliability Engineer (SRE)
$175k - $215k
...experiences — and we’re constantly looking for new ways to enhance these exciting experiences. Sr. Manager, Site Reliability Engineer provides strategic leadership across multiple SRE teams and their managers, ensuring alignment with organizational priorities and functional...
Senior
Disney Experiences
Orlando, FL
4 days ago
Sr. Site Reliability Engineer (SRE)
...security to responsibly propel the global lottery industry ever forward. Position Summary We are looking for a skilled Site Reliability Engineer (SRE) to enhance the stability, performance, and reliability of our production systems. The SRE will work closely with...
Senior
Permanent employment
Work experience placement
Local area
SCIENTIFIC GAMES
Alpharetta, GA
more than 2 months ago
Sr. Site Reliability Engineer
...keep the world running. Location: 5 on-site days a week in Sunnyvale, CA Headquarters. Our Team's Vision: Our Engineering team is shaping the future of... ...looking for an experienced Senior Site Reliability Engineer (SRE) with a strong background in AWS & Azure...
Senior
Full time
Work experience placement
Immediate start
Illumio
Sunnyvale, CA
2 days ago
Software Engineer - Site Reliability Engineer (SRE)
...Lovelace is the only provider of enterprise-scale context engines capable of analyzing trillions of real-time data points... ...~ Lovelace AI is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our growing team. As an SRE at Lovelace AI, you...
Full time
Lovelace Ai
Pittsburgh, PA
10 hours ago
Sr. Site Reliability Engineer I
$134.25k - $214.8k
...Constantly grow as you work hard for a mission that matters at a company where you matter. Your Impact As a Senior Site Reliability Engineer within the APX SRE organization, you’ll focus on delivering practical, scalable solutions to support the reliability and...
Senior
Full time
Work experience placement
Work at office
Remote work
Flexible hours
Axon
Seattle, WA
3 days ago
Site Reliability Engineer (SRE)
...Site Reliability Engineer (SRE) Remote No sponsorship available. Must be able to obtain a Public Trust clearance. What You Will Do We are seeking a Site Reliability Engineer (SRE) to support the SBA Disaster Lending Platform modernization effort in a remote...
Full time
Local area
Remote work
System One
McLean, VA
3 days ago
Site Reliability Engineer (SRE)
$50 - $53 per hour
...Immediate need for a talented Site Reliability Engineer (SRE) This is a 12+ Months contract opportunity with long-term potential and is in Chicago, IL (Hybrid). Please review the job description below and contact me ASAP if you are interested. Job Diva ID...
Contract work
Local area
Immediate start
Remote work
Pyramid Consulting
United States
2 days ago
Site Reliability Engineer (SRE)
$170k - $250k
...Site Reliability Engineer (SRE) Location: San Francisco, CA / Palo Alto, CA Company Stage of Funding: Growth-Stage AI Infrastructure Company ($80M Raised) Office Type: Onsite (4 Days Per Week) Salary: $170,000-$250,000 + Competitive Equity Company Description...
Work at office
Visa sponsorship
Flexible hours
Recruiting from Scratch
San Francisco, CA
4 days ago
Site Reliability Engineer (SRE)
$100k - $180k
...Site Reliability Engineer (SRE) - Remote Bright Vision Technologies is a technology consulting and software development company delivering cloud, AI, data, and enterprise solutions across the United States. This is a fantastic opportunity to join an established...
Full time
H1b
Local area
Immediate start
Remote work
Visa sponsorship
Bright Vision Technologies
Secaucus, NJ
17 hours ago
Site Reliability Engineer (SRE)
...Overview: Role: Site Reliability Engineer (SRE) Location: LOUISVILLE, KY Duration: 6 months • Work with the team to define and implement best practices and standards within the organization • Extensive experience working with Agile/Scrum methodologies...
Purple Drive
Louisville, KY
2 days ago
Site Reliability Engineer (SRE)
...Site Reliability Engineer (SRE) At Air Apps, we believe in thinking bigger—and moving faster. We're a family-founded company on a mission to create the world's first AI-powered Personal & Entrepreneurial Resource Planner (PRP), and we need your passion and ambition...
Remote work
Worldwide
Air Apps
United States
2 days ago
Site Reliability Engineer (SRE)
$113.9k - $200.91k
...Site Reliability Engineer (SRE) | Lockheed Martin The 1LMX MES COE is seeking an engineer who will own infrastructure-as-code, cloud platform, and reliability for the Apriso environment on AWS. This role blends full-stack development, DevOps, and Site Reliability Engineering...
Full time
Temporary work
Work experience placement
Work at office
Remote work
Flexible hours
3 days per week
Lockheed Martin Corporation
United States
1 day ago
Site Reliability Engineer (SRE)
...advances cures by helping the world's most important research sites do their best work. Our solutions are now used by over 30,0... ...What You'll Bring to the Team: We are seeking a Site Reliability Engineer (SRE) to join one of our Scrum teams and help ensure the...
Work at office
Florence
Atlanta, GA
5 days ago
Site Reliability Engineer (SRE)
...Site Reliability Engineer (SRE) FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge...
Work at office
Weekend work
Fluix AI
San Francisco, CA
2 days ago
Site Reliability Engineer (SRE)
...recently the new LIDAR iPad sensor. We are looking for the right Site Reliability Engineer to help us take our efforts to the next level. In this... ...Computer Vision Organization. As a main contributor to our SRE team you will develop and maintain infrastructure, tooling,...
Work experience placement
Apple
San Diego, CA
2 days ago
Site Reliability Engineer (SRE)
...About the job Site Reliability Engineer (SRE) ***W2 only*** Position: Site Reliability Engineer (SRE) Work Authorization: All Work Authorizations Location: Reston, VA Contract: 24 months Description: Site Reliability Engineer (SRE) roles and...
Contract work
Knack Solutions
Reston, VA
5 days ago
Site Reliability Engineer (SRE)
...Site Reliability Engineer (SRE) Hartford, CT Job Summary We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our systems and services. The SRE will bridge development and operations by applying...
Info Way Solutions
Hartford, CT
4 days ago
Site Reliability Engineer (SRE)
...automate, deploy, and operate highly reliable cloud systems supporting mission-critical... ...role is centered on DevSecOps and site reliability engineering, with a strong emphasis on deployment... ...years of professional experience as an SRE, DevOps, reliability, infrastructure,...
Permanent employment
Remote work
Quindar
United States
2 days ago
Site Reliability Engineer (SRE)
...Site Reliability Engineer (SRE) Location: North Little Rock AR (onsite) Duration: Contract Required/Desired Skills: • Strong web development skills with a strong focus in C#/.NET • Someone who currently works in a hybrid skillset of BOTH.Net development AND...
Contract work
Software Technology Inc
North Little Rock, AR
2 days ago
Site Reliability Engineer (SRE)
$170k - $230k
...Site Reliability Engineer (SRE) Palo Alto / San Francisco Bay Area About Mithril Mithril is an AI infrastructure platform built to make GPU compute more accessible and affordable for the world's leading enterprises, AI startups, and the AI research community,...
Work at office
Local area
1 day per week
Mithril
Palo Alto, CA
2 days ago
Site Reliability Engineering (SRE) Architect
...Site Reliability Engineering (SRE) Architect Location: Atlanta,GA Duration: 12Months + Extension Hourly Rate: Depending on Experience (DOE) Work Authorization: As an SRE Architect, you will be a pivotal technical leader responsible for designing, building, and evolving...
Hourly pay
Permanent employment
Contract work
Local area
Early shift
Cloud Hybrid Technologies LLC
Atlanta, GA
2 days ago
Site Reliability Engineer (SRE)
$100k - $200k
OPPO US Research Center is seeking a skilled and proactive Site Reliability Engineer (SRE) to join our team. In this role, you will be responsible for ensuring the stability, scalability, and performance of our application systems. The ideal candidate is passionate about...
Full time
OPPO
Palo Alto, CA
4 days ago
Site Reliability Engineer (SRE)
$1,000 per month
...international markets. About the role We're looking for a talented Site Reliability Engineer to join our infrastructure team and help us maintain the... ...foundation that powers our global banking platform. As an SRE at Bloxley, you'll be responsible for ensuring our financial...
Full time
Immediate start
Remote work
Worldwide
Flexible hours
Bloxley
Mission, KS
10 hours ago
Sr. Site Reliability Engineer
4+ years of experience in an SRE, DevOps, or cloud infrastructure role. Strong experience with Azure cloud services and infrastructure. Hands-on experience with java and Terraform and Terragrunt for infrastructure-as-code. Proficiency with Kubernetes (preferably AKS), Databricks...
Senior
Compunnel
Alpharetta, GA
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Sr. Site Reliability Engineer (SRE). Be the first to apply!