Sr. Site Reliability Engineer (SRE)

$165k - $225k

Full-time

Moonlite

Job Description

Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.

Your Role:

You will be instrumental in building and operating production-grade AI infrastructure with deep Kubernetes expertise at its core. Working closely with our systems engineers, network engineers, and platform engineering team, you'll architect and operate the Kubernetes infrastructure that powers our control plane and orchestrates compute, storage, and networking at scale. This role requires deep understanding of Kubernetes internals, custom resource definitions (CRDs), storage and network integrations, and building production-grade clusters from the ground up (not just deploying in managed environments). You'll ensure enterprise-grade reliability while establishing the automation, observability, and operational practices.

Job Responsibilities

Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads.
Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads.
Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains.
GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization.
Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement.
Infrastructure Automation: Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions.
Production Operations & Reliability: Manage production bare-metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments.
Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR.
Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads.

Requirements

Experience: 5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale.
Kubernetes Infrastructure Expertise: Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies.
Kubernetes Internals & Integration: Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes.
Linux Systems Experience: Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments.
Infrastructure Automation: Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead.
Networking Fundamentals: Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production.
Observability & Monitoring: Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems.
Reliability Practices: Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems.
Scripting & Automation: Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency.
Problem-Solving Under Pressure: Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages.
Collaboration & Communication: Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers.

Preferred Qualifications

Experience building custom Kubernetes operators or controllers for infrastructure orchestration
Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management
Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins
Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions
Experience with Kubernetes cluster federation or multi-cluster management
Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes
Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar)
Familiarity with configuration management at scale and GitOps practices
Understanding of security best practices for Kubernetes and bare-metal infrastructure
Experience operating infrastructure in regulated industries or co-located data center environments
Background supporting research institutions, technical computing environments, or enterprise AI infrastructure

Key Technologies

Kubernetes, Linux, Terraform, Ansible, Prometheus, Grafana, ELK Stack, Go, Python, Bash, NVIDIA GPU Technologies, High-Performance Networking, Enterprise Storage Systems

Why Moonlite

Build Critical Research Infrastructure: Your work will directly enable quantitative research teams and AI practitioners to push the boundaries of what's possible in financial modeling and AI research.
Enterprise Impact: Build and operate infrastructure that supports mission-critical research and AI workloads for leading financial institutions and research organizations.
Technical Excellence: Join an infrastructure team focused on delivering enterprise-grade reliability while pushing the boundaries of high-performance computing capabilities.
Hands-On Ownership: As part of our growing infrastructure team, you'll have significant ownership over critical systems and the autonomy to influence our operational practices and technology choices.
Industry Leadership: Work alongside experienced infrastructure professionals who have built and operated systems for the most demanding computing environments.

We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together.

#li-remote

Apply

Vacancy posted a month ago

Similar jobs that could be interesting for youBased on the Sr. Site Reliability Engineer (SRE) in Chicago, IL vacancy

Senior Site Reliability Engineer - Google Distributed Cloud Edge (Edge SRE)
Senior Site Reliability Engineer - Google Distributed Cloud Edge (Edge SRE) Location: Hybrid - Chicago, IL (preferred) | Employment Type: W2, Contract to Hire, Direct Hire Overview Our client is seeking a highly skilled Edge Site Reliability Engineer (Edge SRE) to lead...
Senior
Contract work
CoSourcing Partners Inc.
Chicago, IL
1 day ago
Site Reliability Engineer II — Low-Latency Trading SRE
$93.9k - $156.5k
CME Group Inc. is looking for a Site Reliability Engineer II in Chicago to assist in building, operating, and scaling systems. This role requires a keen interest in SRE and skills in Linux, programming, and problem-solving. Candidates will work with senior engineers and...
Suggested
CME Group Inc.
Chicago, IL
3 days ago
Senior Site Reliability Engineer
$130k - $180k
...Job Description Job Description SRE is part of a global organization that leverages the latest technology to communicate... ..., collaboration, and accomplishment. Being a Senior Site Reliability Engineer at iManage Means… You are an engineer, a builder, and a systems...
Senior
Full time
Work at office
Local area
Remote work
Worldwide
Monday to Friday
Flexible hours
iManage
Chicago, IL
16 days ago
Senior Site Reliability Engineer - Hybrid, Unlimited PTO
...cutting-edge technology company in Chicago is seeking a Senior Site Reliability Engineer to maintain the reliability and operational health of their... .... You will design scalable cloud infrastructure, define SRE best practices, and mentor other engineers. Ideal candidates...
Senior
Ubiety
Chicago, IL
4 days ago
Senior Site Reliability Engineer
...or C2C. Must be Permanent Resident or US Citizen Senior Site Reliability Engineer Description and Requirements About Our Team We are... ...users. This role may support one of several teams within the SRE organization (e.g., Observability, Operations, or Service...
Senior
Permanent employment
Remote work
SDI International
Chicago, IL
2 days ago
Senior Site Reliability Engineer
$130k - $150k
...place to live. FLEXIBLE HOURS FLEXIBLE PTO Open Roles Senior Site Reliability Engineer Chicago, IL Ubiety is the creator of HomeAware, an AI-... ...health of the systems that power HomeAware. This is primarily an SRE role (~70%), with meaningful contributions to Backend Development...
Senior
Full time
Work at office
Flexible hours
Ubiety
Chicago, IL
4 days ago
Senior Site Reliability Engineer
$129k - $160k
Senior Site Reliability Engineer page is loaded## Senior Site Reliability Engineerlocations: Chicago, Illinoistime type: Full timeposted on: Posted... ...at scale.**As a Senior Site Reliability Engineer (SRE) at TAG - The Aspen Group**, you will be responsible for ensuring...
Senior
The Aspen Group
Chicago, IL
10 hours ago
Senior Site Reliability Engineer
$125.04k - $187.56k
...services, including Finance, Legal, Sustainability, Commercial, Digital and E-commerce, Technology and more. Overview The Site Reliability Engineer (SRE) III is responsible for ensuring the scalability, reliability, and performance of production systems through automation...
Senior
Full time
Work at office
Remote work
Flexible hours
ViziRecruiter,LLC.
Chicago, IL
more than 2 months ago
Senior Site Reliability Engineer (Azure)
...Career Renew is recruiting for one of its clients a Senior Site Reliability Engineer (Azure) - this is a fully remote role for US or Europe based... ...strategies and tested RTO/RPO. Partner with the Head of SRE to define the reliability roadmap, platform architecture, and...
Senior
Full time
Remote work
Career Renew
Chicago, IL
a month ago
Senior Site Reliability Engineer, Platform & Cloud FinOps (100% Remote - USA Central & EST)
About the job We are looking for a senior site reliability engineer to join the Cloud FinOps team at Hopper. We manage a large infrastructure in Google... ...team of SREs. An ideal candidate has Strong background in SRE, DevOps, Software Engineering or Systems engineering...
Senior
Remote job
Work from home
Sleeping nights
Hopper
Chicago, IL
2 days ago
Technical Senior Manager - Site Reliability Engineering
$94k - $163k
...looking for a Technical Senior Manager of SRE to play a central role in the... ...approximately 70% of time to hands‑on engineering tasks, such as developing new deployments... ...Collaboration: Proven ability to collaborate with Site Reliability Engineers and cross‑functional teams,...
Senior
Work at office
Flexible hours
Ring Inc
Chicago, IL
10 hours ago
Hybrid Senior Site Reliability Engineer - Cloud & Automation
$125.04k - $187.56k
A leading global food retailer is seeking a Site Reliability Engineer (SRE) III to enhance system reliability and performance through automation and observability. This role is crucial for operational excellence in a cloud-native environment and involves collaborating...
Senior
Work at office
Flexible hours
ViziRecruiter,LLC.
Chicago, IL
2 days ago
Senior / Principal SRE Tech Lead
$190k - $230k
...Android, and cloud. We are expanding the reliability engineering organization that powers Qira, Lenovo’s... ...Personal AI. We are looking for Senior Site Reliability Engineers (SREs) to help... ...support one of several teams within the SRE organization (e.g., Observability, Operations...
Senior
Local area
Remote work
Lenovo
Chicago, IL
10 hours ago
Site Reliability Engineer II
$93.9k - $156.5k
Site Reliability Engineer II page is loaded## Site Reliability Engineer IIlocations: Chicago - 20 S. Wackertime type: Full timeposted on: Posted Todayjob... ...candidates in the Chicago area.**CME Group is seeking a **SRE II** to help, build, operate and scale systems in our...
Work at office
Local area
Worldwide
2 days per week
CME Group Inc.
Chicago, IL
1 day ago
Site Reliability Engineer - Algorithmic Trading
$130k - $225k
...integrity, innovation and a willingness to challenge consensus. The Algorithmic Trading Team is looking for a Site Reliability Engineer for our Chicago office. The SRE team is critical to the success of our trading - ensuring that our production trading systems, test...
Temporary work
Work at office
Flexible hours
P2P
Chicago, IL
1 hour ago
Site Reliability Engineer
...and have the ticket updated with latest findings/RCA Represent SRE in all client calls and have the deep knowledge on all the tickets... ...if their integrations fail Measure the front-end metrics for the site with various tools available Qualifications Must have worked on...
TechDigital Group
Chicago, IL
10 hours ago
Site Reliability Engineer
We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) to join our dynamic team. In this role, you will apply SRE principles to increase the reliability, scalability, and performance of critical enterprise applications. You will partner with cross...
Compunnel, Inc.
Chicago, IL
3 days ago
Site Reliability Engineer III
$127.33k - $159.17k
...Service Management. It’s our goal to always provide an engaging, relevant, and simple experience for our customers. The Site Reliability Engineer (SRE) - Edge Platform is a key member of the Edge Operations and SRE team within Global Technology Infrastructure &...
Local area
Flexible hours
Shift work
McDonald's Corporation
Chicago, IL
2 days ago
Staff Site Reliability Engineer
$128.5k - $214.1k
We're looking for a Staff Site Reliability Engineer to join our team, focusing on the core systems that power global financial markets. This isn't... ...manual toil and enhancing operational excellence.* **Integrate** SRE principles directly into the software development lifecycle,...
Work at office
Worldwide
2 days per week
CME Group Inc.
Chicago, IL
2 days ago
Staff Software Engineer - SRE (Remote)
...the Role As a Staff Software Engineer for the Platform Engineering... ...and spearhead the adoption of SRE best practices, while another you build and maintain reliable CI/CD pipelines, tooling and infrastructure... ...~8+ years of experience with Site Reliability Engineering and/or...
Full time
Live in
Currently hiring
Remote work
Home office
Flexible hours
GrabJobs
Chicago, IL
4 days ago
SR Principal Software Engineer - LLM Engineering
...Senior Principal Software Engineer We're looking for a tech leader ready to take their career... ...with data science, platform engineering, and SRE teams to productionize the models on AWS, ensuring observability, reliability, and cost efficiency. Leads deployment and...
Senior
Chase
Chicago, IL
1 day ago
Site Reliability Engineering Manager II
$160k - $200k
...We, at Flywire, are looking for an experienced Manager II, Site Reliability Engineering to join our team. In this role, you’ll help drive... ...performance within our cloud-based infrastructure. At Flywire, the SRE team is responsible for the lifecycle of production systems...
Full time
Temporary work
Local area
Immediate start
Remote work
Shift work
Flywire
Chicago, IL
16 days ago
Data Services SRE — Cloud-Native Reliability Engineer
A leading technology firm is seeking a Site Reliability Engineer for its Data Services team. This role involves eliminating operational toil through automation, collaboration on server reliability, and maintaining high-availability data clusters in a cloud-native environment...
iManage
Chicago, IL
3 days ago
Senior Platform Engineer: Cloud-Native Infra & SRE
A global financial services company in Chicago is seeking a Senior Platform Engineer to lead the development of cloud-native infrastructure. This role emphasizes automation and collaboration across teams, requiring experience with GCP and Terraform. The ideal candidate...
Senior
CME Group Inc.
Chicago, IL
1 day ago
Senior Azure Platform Engineer: Cloud Infra, IaC & SRE
$81.4k - $151.8k
Hispanic Alliance for Career Enhancement seeks a Senior Platform Engineer to design and operate Azure cloud infrastructures in a hands-on role. You will work with cloud engineering and data teams, influencing technical direction while focusing on security and scalability...
Senior
Hispanic Alliance for Career Enhancement
Chicago, IL
2 days ago
Sr. DevOps Engineer
...find the best job for you. Role:Sr. DevOps Engineer Location: Chicago, IL Duration: 6... ...lead technical initiatives. Performance Reliability Troubleshoot issues, optimize performance, and ensure high availability (SRE principles). KNOWLEDGE SKILLS Bachelor...
Senior
Permanent employment
Contract work
Remote work
Tekfortune Inc
Chicago, IL
10 hours ago
Senior / Staff Software Engineer (Observability / SRE)
$148k - $249k
...stakeholders and leadership. Qualifications: - 5+ years software engineering or systems/performance engineering experience (BS in CS/EE or... ...Regularly scheduled team building activities and social events both on-site, off-site & virtually. - As we grow, this list continues to...
Senior
Full time
Work at office
Work from home
Flexible hours
GrabJobs
Chicago, IL
10 hours ago
Sr Implementation Lead, SRE (CoP)
$164.6k - $288k
...using leading technology and exceptional service. Overview The SRE Community of Practice (CoP) Senior Implementation Lead is... ...for driving the adoption, standardization, and maturity of Site Reliability Engineering (SRE) practices across the organization. This role serves...
Senior
H1b
Relha LLC
Chicago, IL
1 day ago
Senior SRE Program Manager: Delivery & Reliability
A leading workforce solutions firm is seeking a Program Manager IV - SRE to enhance service reliability and operational excellence. This role involves leading cross-functional teams in delivering SRE outcomes, managing program artifacts, and engaging with executive stakeholders...
Senior
Flexible hours
ManpowerGroup Global, Inc.
Chicago, IL
4 days ago
Sr Cloud Infrastructure & Security Engineer
...Senior Cloud Infrastructure & Security Engineer Role Lead Infrastructure Engineer Chicago, IL – Hybrid: 3 days onsite/2 days WFH... ...applications ~ Experience with Infrastructure monitoring and SRE practices ~ Leadership experience managing small teams...
Senior
Work from home
1872 Consulting
Chicago, IL
6 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Sr. Site Reliability Engineer (SRE). Be the first to apply!