Sr. Site Reliability Engineer (SRE)

$165k - $225k

Moonlite

Job Description

Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.

Your Role:

You will be instrumental in building and operating production-grade AI infrastructure with deep Kubernetes expertise at its core. Working closely with our systems engineers, network engineers, and platform engineering team, you'll architect and operate the Kubernetes infrastructure that powers our control plane and orchestrates compute, storage, and networking at scale. This role requires deep understanding of Kubernetes internals, custom resource definitions (CRDs), storage and network integrations, and building production-grade clusters from the ground up (not just deploying in managed environments). You'll ensure enterprise-grade reliability while establishing the automation, observability, and operational practices.

Job Responsibilities

Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads.
Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads.
Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains.
GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization.
Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement.
Infrastructure Automation: Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions.
Production Operations & Reliability: Manage production bare-metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments.
Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR.
Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads.

Requirements

Experience: 5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale.
Kubernetes Infrastructure Expertise: Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies.
Kubernetes Internals & Integration: Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes.
Linux Systems Experience: Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments.
Infrastructure Automation: Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead.
Networking Fundamentals: Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production.
Observability & Monitoring: Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems.
Reliability Practices: Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems.
Scripting & Automation: Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency.
Problem-Solving Under Pressure: Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages.
Collaboration & Communication: Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers.

Preferred Qualifications

Experience building custom Kubernetes operators or controllers for infrastructure orchestration
Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management
Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins
Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions
Experience with Kubernetes cluster federation or multi-cluster management
Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes
Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar)
Familiarity with configuration management at scale and GitOps practices
Understanding of security best practices for Kubernetes and bare-metal infrastructure
Experience operating infrastructure in regulated industries or co-located data center environments
Background supporting research institutions, technical computing environments, or enterprise AI infrastructure

Key Technologies

Kubernetes, Linux, Terraform, Ansible, Prometheus, Grafana, ELK Stack, Go, Python, Bash, NVIDIA GPU Technologies, High-Performance Networking, Enterprise Storage Systems

Why Moonlite

Build Critical Research Infrastructure: Your work will directly enable quantitative research teams and AI practitioners to push the boundaries of what's possible in financial modeling and AI research.
Enterprise Impact: Build and operate infrastructure that supports mission-critical research and AI workloads for leading financial institutions and research organizations.
Technical Excellence: Join an infrastructure team focused on delivering enterprise-grade reliability while pushing the boundaries of high-performance computing capabilities.
Hands-On Ownership: As part of our growing infrastructure team, you'll have significant ownership over critical systems and the autonomy to influence our operational practices and technology choices.
Industry Leadership: Work alongside experienced infrastructure professionals who have built and operated systems for the most demanding computing environments.

We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together.

#li-remote

Apply

Vacancy posted 23 days ago

Similar jobs that could be interesting for youBased on the Sr. Site Reliability Engineer (SRE) in Chicago, IL vacancy

Site Reliability Engineer (SRE)
...Site Reliability Engineer (SRE) Immediate need for a talented Site Reliability Engineer (SRE). This is a 12+ months contract opportunity with long-term potential and is in Chicago, IL (Hybrid). Key Requirements and Technology Experience: ~ Must have skills:...
Suggested
Contract work
Immediate start
Pyramid Consulting
Chicago, IL
3 days ago
Sr Site Reliability Engineer
$109.5k - $150.55k
...Job Description Renaissance is looking for an experienced Sr Site Reliability Engineer to be part of the Engineering Enablement group's Site... ...team and looking for someone who has been involved in the SRE implementation journey at other companies. We are looking for...
Senior
For contractors
Local area
Remote work
Worldwide
Work visa
Flexible hours
Weekend work
Renaissance Services
Chicago, IL
1 day ago
Sr Lead Site Reliability Engineer
$132.23k - $176.31k
...future of AI‑ready connectivity, join us today. The Role We are seeking a highly skilled and proactive Senior Lead Site Reliability Engineer (SRE) to join our team, focusing on production support and performance optimization across our portal ecosystem. This role...
Senior
Full time
Temporary work
Remote work
Lumen
Chicago, IL
5 days ago
Staff Site Reliability Engineer
$112.5k - $187.5k
...TransUnion, this role will report to a DevOps Director. The Site Reliability Engineering team drives reliability strategy, elevates engineering... ...serve as a senior technical leader and force multiplier on the SRE team. Operating with full autonomy, you will drive reliability...
Suggested
Full time
Temporary work
Work experience placement
Work at office
Flexible hours
2 days per week
Transunion
Chicago, IL
4 days ago
Manager Site Reliability Engineering
$125.83k - $221.28k
...Site Reliability Engineering Manager At the Federal Home Loan Bank of Chicago, employees come first - that's why we offer a highly competitive... ...Engineering function and seeking a leader who can establish SRE practices across the organization while developing a team of...
Suggested
Work at office
Remote work
FHLBank Chicago
Chicago, IL
5 days ago
Senior Site Reliability Engineer, Observability
$160k - $200k
...expanded umbrella. THE WORK: This is an engineering-first role with a coaching... ...time doing hands-on observability and reliability engineering work: building... ...engagement. WHAT YOU'LL BRING: Core SRE Experience 7+ years in Site Reliability Engineering, DevOps, or...
Senior
Full time
Work at office
Local area
Ripple
Chicago, IL
3 days ago
Site Reliability Engineer
...the ticket updated with latest findings/RCA Should represent SRE in all client calls and have the deep knowledge on all the... ...all the external vendors if in case their integrations fail, Measure the front-end metrics for the site with various tools available...
Omni Inclusive
Chicago, IL
13 hours ago
Site Reliability Engineer
...Edward Jones Site Reliability Engineer 100% remote Initial contract is 6 months, but will be a multi year engagement. Position Overview... ...Mainframe/DB2, Oracle, MongoDB, Messaging systems(Kafka) and SRE tools like Dynatrace, and Splunk will be instrumental in...
Contract work
Remote work
HCL Global Systems
Chicago, IL
13 hours ago
Senior Azure Platform Engineer: Cloud Infra, IaC & SRE
$81.4k - $151.8k
...Hispanic Alliance for Career Enhancement seeks a Senior Platform Engineer to design and operate Azure cloud infrastructures in a hands-on role. You will work with cloud engineering and data teams, influencing technical direction while focusing on security and scalability...
Senior
Hispanic Alliance for Career Enhancement
Chicago, IL
1 day ago
Sr. DevOps Engineer
...find the best job for you. Role:Sr. DevOps Engineer Location: Chicago, IL Duration: 6... ...lead technical initiatives. Performance Reliability Troubleshoot issues, optimize performance, and ensure high availability (SRE principles). KNOWLEDGE SKILLS Bachelor...
Senior
Permanent employment
Contract work
Remote work
Tekfortune Inc
Chicago, IL
2 days ago
Sr. Solutions Engineer - Chicago
$210k - $230k
...Sr. Solutions Engineer CloudBees enables enterprises to deliver scalable, compliant, and secure software, empowering developers to do their... ...: Background as a Software Engineer, DevOps Engineer, or SRE Familiarity with enterprise security, compliance, and governance...
Senior
Temporary work
Local area
Remote work
Flexible hours
CloudBees
Chicago, IL
4 days ago
Senior Software Engineer - Site Reliability Engineering
$130k - $165k
...Job Title: Senior Software Engineer Company: Snapsheet Job Location: USA, Remote... ...Job Department: Technology Team : Site Reliability Engineering About Snapsheet: Snapsheet... ...As a Senior Site Reliability Engineer (SRE) at Snapsheet, you will play a critical...
Senior
Full time
Temporary work
Local area
Remote work
Visa sponsorship
Work visa
Flexible hours
Snapsheet
Chicago, IL
14 days ago
Senior Site Reliability Engineer (Hybrid)
$145k - $175k
...to help you gain your full potential. Job Overview The Site Reliability Engineer supports deployments, cloud infrastructure, and monitoring... ...infrastructure improvements. You'll be joining a small, senior SRE team with broad ownership of the platforms and...
Senior
Full time
Temporary work
Work at office
Local area
Flexible hours
3 days per week
Rewards Network
Chicago, IL
5 days ago
Senior Site Reliability Engineer
$130k - $180k
...Job Description Job Description SRE is part of a global organization that leverages the latest technology to communicate... ..., collaboration, and accomplishment. Being a Senior Site Reliability Engineer at iManage Means… You are an engineer, a builder, and a systems...
Senior
Work at office
Local area
Remote work
Worldwide
Monday to Friday
Flexible hours
iManage
Chicago, IL
7 days ago
Sr. Software Engineer
...Sr. Software Engineer Location is Chicago IL, try to find local if not please get remote. Python f/w for Backend – preferably FAST API Kubernetes/Docker (preferably AKS) Strong hands-on experience (this is not a project management role at onsite) Experience on:...
Senior
Local area
Keylent Inc
Chicago, IL
12 hours ago
Sr. Manager, AI Platform Engineering
$147.06k - $191.52k
...will drive architecture, development, and operations of our ML engineering and GenAI systems, enabling scalable and responsible AI... ...the enterprise AI engineering platform, ensuring scalability, reliability, and automation Take ownership of observability, and resilient...
Senior
Hourly pay
United Airlines
Chicago, IL
14 days ago
Sr Java Developer
...-paced environment. You will work independently on complex features, collaborate with teams and business stakeholders, and ensure reliable, high-quality deliverables. Key Responsibilities: Develop Spring Boot microservices using JDK 11/17 , JAX-RS (Apache CXF...
Senior
Full time
IVidTek, Inc.
Chicago, IL
2 days ago
Senior Software Engineer, Observability
$147k - $202k
...Auth0 Platform, and we are looking for an Observability Engineer to help ensure that our Product and Platform Engineers can... ...platform stability. If you have experience within the Site Reliability Engineering (SRE) field or working as a Development Operations (DevOps) engineer...
Senior
Local area
Worldwide
Flexible hours
Okta, Inc.
Chicago, IL
4 days ago
Sr Software Engineer
...Job Description We're seeking a Sr Software Engineer to support one of our innovative clients operating in the automation and robotics... ...deliver features, improve performance, and support system reliability Qualifications ~5+ years of experience in software engineering...
Senior
Temporary work
Local area
Design Hire
Chicago, IL
4 days ago
Sr. Software Engineer
$106.28k - $145k
...The Role As a Senior Software Engineer at CCC, you'll play a key role in building... ...technical solutions, while helping improve the reliability, performance, and scalability of our... ...benefits, please check out our careers site. Here, you belong. You are seen, valued...
Senior
CCC Intelligent Solutions, Inc.
Chicago, IL
13 hours ago
Site Reliability Engineer
$58 per hour
...Job Title Site Reliability/DevOps Engineer End Client Northern Trust Bill Rate $58/hr Location Chicago, IL (Onsite Day 1) VISA : USC /GC(... ..., predominantly Wealth management - Must Technical : SRE Tools / Technologies: Experience of 5+ years 8+ years of...
Work at office
3 days per week
Infosys
Chicago, IL
28 days ago
Sr Systems Engineer- Reliability Engineering and Operations
$108.7k - $181.1k
...Sr Systems Engineer The Sr Systems Engineer supports the Reliability Engineering and Operations (REO) at CME Group. The incumbent must have strong knowledge of Windows server administration, configuration, networking, scripting and automation, large scale distributed...
Senior
Worldwide
Shift work
Weekend work
CME Group
Chicago, IL
4 days ago
Sr. Java Developer
~ Around 10 years of IT experience in development, configuration, assembly and deployment of web and client/server applications in Java/J2EE based applications. ~4 years of experience with Spring MVC, Hibernate, Struts. ~5 years of experience in Cloud environment...
Senior
3B Staffing LLC
Chicago, IL
13 hours ago
Senior Golang Engineer
...Senior Golang Engineer Location: Chicago Work Arrangement: Hybrid Employment Type: Contract Overview We are seeking... ...concurrency, resiliency, and low latency Work closely with SRE and platform teams to improve reliability and scalability Implement observability, tracing, and...
Senior
Contract work
GCS Recruitment
Chicago, IL
1 day ago
Sr Technical Lead (Java, AWS)
$112.48k - $146.54k
hackajob is collaborating with United Airlines to connect them with exceptional professionals for this role. Achieving our goals starts with supporting yours. Grow your career, access top-tier health and wellness benefits, build lasting connections with your team...
Senior
Hourly pay
United Airlines
Chicago, IL
14 days ago
Sr. C++ Developer
We are looking for a senior C++ developer to help us prosper our massive existing codebase for our core Security products. We are looking to enhance the suite extensively in the coming months in terms of both functional enhancements and non‑functional critical C++ related...
Senior
TechDigital Group
Chicago, IL
5 days ago
Sr Java Developer
Senior Java Developer:- Location:- Chicago IL Interview:- Face to Face Interview in Chicago IL ( Candidate has to come for F2F Interview on his own expenses) This is onsite role in Chicago. NO remote "Looking for a Senior Developer to work, deploy, and ...
Senior
Remote work
Relocation
Global Point
Chicago, IL
2 days ago
Sr. DevOps Engineer
...Sr. DevOps Engineer Location: Chicago or Houston - Hybrid onsite at either location Job Type: Contract To Hire (03 Months Contract) Schedule... ...production systems on AWS ~ Datadog ~ Experience with site monitoring and log monitoring tools, specifically Datadog....
Senior
Contract work
Echo IT Solutions
Chicago, IL
3 days ago
Sr. Production Engineer, Solutions Engineering
$139.76k - $287.75k
...recruiting process here. The Production Engineering organization at Pinterest is accountable... ...and methodologies to assure the reliability of our large-scale distributed systems serving... ...workflows) Technical consulting or embedded SRE experience with cross-functional...
Senior
Full time
Work at office
Local area
Remote work
Relocation
Relocation package
Pinterest
Chicago, IL
1 day ago
Sr. Software Engineer
$96k
...Software Engineer The Software Engineer should be experienced in leading software projects and will be responsible for coordination of... ...applicants for domestic positions that require travel to a customer site must be fully vaccinated against COVID-19 as a condition of...
Senior
Work experience placement
The Joint Commission
Oak Brook, IL
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Sr. Site Reliability Engineer (SRE). Be the first to apply!