Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Sr. Site Reliability Engineer (SRE)

$165k - $225k

Moonlite

Job Description

Job Description

Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.

Your Role:

You will be instrumental in building and operating production-grade AI infrastructure with deep Kubernetes expertise at its core. Working closely with our systems engineers, network engineers, and platform engineering team, you'll architect and operate the Kubernetes infrastructure that powers our control plane and orchestrates compute, storage, and networking at scale. This role requires deep understanding of Kubernetes internals, custom resource definitions (CRDs), storage and network integrations, and building production-grade clusters from the ground up (not just deploying in managed environments). You'll ensure enterprise-grade reliability while establishing the automation, observability, and operational practices.

Job Responsibilities
  • Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads.
  • Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads.
  • Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains.
  • GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization.
  • Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement.
  • Infrastructure Automation: Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions.
  • Production Operations & Reliability: Manage production bare-metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments.
  • Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR.
  • Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads.
Requirements
  • Experience: 5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale.
  • Kubernetes Infrastructure Expertise: Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies.
  • Kubernetes Internals & Integration: Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes.
  • Linux Systems Experience: Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments.
  • Infrastructure Automation: Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead.
  • Networking Fundamentals: Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production.
  • Observability & Monitoring: Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems.
  • Reliability Practices: Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems.
  • Scripting & Automation: Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency.
  • Problem-Solving Under Pressure: Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages.
  • Collaboration & Communication: Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers.
Preferred Qualifications
  • Experience building custom Kubernetes operators or controllers for infrastructure orchestration
  • Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management
  • Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins
  • Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions
  • Experience with Kubernetes cluster federation or multi-cluster management
  • Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes
  • Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar)
  • Familiarity with configuration management at scale and GitOps practices
  • Understanding of security best practices for Kubernetes and bare-metal infrastructure
  • Experience operating infrastructure in regulated industries or co-located data center environments
  • Background supporting research institutions, technical computing environments, or enterprise AI infrastructure
Key Technologies
  • Kubernetes, Linux, Terraform, Ansible, Prometheus, Grafana, ELK Stack, Go, Python, Bash, NVIDIA GPU Technologies, High-Performance Networking, Enterprise Storage Systems
Why Moonlite
  • Build Critical Research Infrastructure: Your work will directly enable quantitative research teams and AI practitioners to push the boundaries of what's possible in financial modeling and AI research.
  • Enterprise Impact: Build and operate infrastructure that supports mission-critical research and AI workloads for leading financial institutions and research organizations.
  • Technical Excellence: Join an infrastructure team focused on delivering enterprise-grade reliability while pushing the boundaries of high-performance computing capabilities.
  • Hands-On Ownership: As part of our growing infrastructure team, you'll have significant ownership over critical systems and the autonomy to influence our operational practices and technology choices.
  • Industry Leadership: Work alongside experienced infrastructure professionals who have built and operated systems for the most demanding computing environments.

We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together.

#li-remote

Vacancy posted 24 days ago
Similar jobs that could be interesting for youBased on the Sr. Site Reliability Engineer (SRE) in Chicago, IL vacancy
  •  ...partnered with our client in their search for a Senior SRE to work CST hours. Responsibilities Applies software engineering practices to IT operations tasks to maintain a scalable and reliable production environment for running software services create... 
    Senior
    Work experience placement
    Remote work

    Korn Ferry

    Chicago, IL
    5 days ago
  •  ...Senior Site Reliability Engineer – Google Distributed Cloud Edge (Edge SRE) Location: Hybrid – Chicago, IL (preferred) Employment Type: W2, Contract to Hire, Direct Hire Overview Our client is seeking a highly skilled Edge Site Reliability Engineer (Edge SRE... 
    Senior
    Contract work

    CoSourcing Partners - Enterprise-AI and IT Services Company

    Chicago, IL
    4 days ago
  • $93.9k - $156.5k

    CME Group Inc. is looking for a Site Reliability Engineer II in Chicago to assist in building, operating, and scaling systems. This role requires a keen interest in SRE and skills in Linux, programming, and problem-solving. Candidates will work with senior engineers and... 
    Suggested

    CME Group Inc.

    Chicago, IL
    1 day ago
  • Hitachi Vantara Corporation is looking for a Site Reliability Engineer (SRE) to design and operate the enterprise observability stack, including Azure Monitor and Managed Grafana. This position requires extensive experience in SRE and cloud infrastructure, with a focus... 
    Senior

    Hitachi Vantara Corporation

    Chicago, IL
    3 days ago
  • $130k - $140k

    GlobalLogic is seeking a Senior Infrastructure Engineer in Deer Park, IL, to design and operate the enterprise observability stack. The ideal candidate has 7+ years in SRE or cloud infrastructure engineering, deep expertise in Microsoft Azure, and strong skills in Infrastructure... 
    Senior

    GlobalLogic

    Chicago, IL
    4 days ago
  • $145k - $175k

     ...to help you gain your full potential. Job Overview The Site Reliability Engineer supports deployments, cloud infrastructure, and monitoring...  ...infrastructure improvements. You'll be joining a small, senior SRE team with broad ownership of the platforms and... 
    Senior
    Full time
    Temporary work
    Work at office
    Local area
    Flexible hours
    3 days per week

    Rewards Network

    Chicago, IL
    6 days ago
  • $130k - $180k

     ...Job Description Job Description   SRE is part of a global organization that leverages the latest technology to communicate...  ..., collaboration, and accomplishment. Being a Senior Site Reliability Engineer at iManage Means…  You are an engineer, a builder, and a systems... 
    Senior
    Work at office
    Local area
    Remote work
    Worldwide
    Monday to Friday
    Flexible hours

    iManage

    Chicago, IL
    8 days ago
  • $60k - $70k

     ...Junior SRE/DevOps Engineer Choosing Capgemini means choosing a company where you will be empowered to shape your career in the way you’d like, where you’ll be supported and inspired by a collaborative community of colleagues around the world, and where you’ll be able... 
    Permanent employment
    Full time
    Contract work
    Internship
    Local area
    Relocation

    Capgemini

    Chicago, IL
    1 day ago
  •  ...Title: SRE Devops Engineer Location: Chicago, IL (Hybrid - Candidates must work from the office 3 days per week) Duration: 6+ months Implementation Partner: Infosys End Client: To be disclosed JD The Goals Driven Wealth Management (GDWM) platform... 
    Work at office
    3 days per week

    Wise Skulls

    Chicago, IL
    3 days ago
  •  ...or C2C. Must be Permanent Resident or US Citizen Senior Site Reliability Engineer Description and Requirements About Our Team We are...  ...users. This role may support one of several teams within the SRE organization (e.g., Observability, Operations, or Service... 
    Senior
    Permanent employment
    Remote work

    SDI International

    Chicago, IL
    7 hours ago
  • $160k - $200k

    Ripple in Chicago is seeking a Senior Site Reliability Engineer to enhance product reliability and performance. In this role, you will engage with...  ...operations. The ideal candidate has over five years of experience in SRE or DevOps, with a strong grasp of security practices and... 
    Senior

    Ripple

    Chicago, IL
    3 days ago
  • $140k - $205k

     ...Senior Technology Site Reliability Engineer Cooley is seeking a Senior Site Reliability Engineer to join the Infrastructure & Development Operationsteam...  ...summary: The Senior Technology Site Reliability Engineer("SRE") is responsible for ensuring the reliability, scalability,... 
    Senior
    Full time
    Temporary work
    Work at office
    Flexible hours
    Weekend work

    Cooley

    Chicago, IL
    9 days ago
  • $129k - $160k

     ...About the Company As a Senior Site Reliability Engineer (SRE) at TAG – The Aspen Group, you will be responsible for ensuring the reliability, performance, and scalability of our core systems. This role involves proactively building and managing, monitoring solutions... 
    Senior

    TAG - The Aspen Group

    Chicago, IL
    3 days ago
  • $125.04k - $187.56k

     ...USA company team includes just over 100 associates across all East Coast office locations. Primary Purpose The Site Reliability Engineer (SRE) III is responsible for ensuring the scalability, reliability, and performance of production systems through automation... 
    Senior
    Full time
    Work at office
    Local area
    Remote work
    Flexible hours

    Peapod Digital Labs

    Chicago, IL
    4 days ago
  • $127k - $249k

     ...The Team Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational functions that...  ...alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and... 
    Senior
    Work at office
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    Chicago, IL
    5 days ago
  • $106k - $130k

     ...sponsorship. Overall Purpose To create and maintain the next generation of application infrastructure and to be responsible for reliability, automation and scalability using and the latest best practices. Essential Functions Implement software and tools to... 
    Senior
    Hourly pay
    Work experience placement
    Work at office
    Immediate start
    Visa sponsorship
    Work visa
    Flexible hours

    Early Warning Services

    Chicago, IL
    3 days ago
  • $130k - $165k

     ...Job Title: Senior Software Engineer Company: Snapsheet Job Location: USA, Remote...  ...Job Department: Technology  Team : Site Reliability Engineering About Snapsheet: Snapsheet...  ...As a Senior Site Reliability Engineer (SRE) at Snapsheet, you will play a critical... 
    Senior
    Full time
    Temporary work
    Local area
    Remote work
    Visa sponsorship
    Work visa
    Flexible hours

    Snapsheet

    Chicago, IL
    3 days ago
  •  ...SRE Engineer (with Azure) Chicago, IL Hybrid: 3 to 4 days office work Primary Role & Responsibilities SRE Engineer with...  ...automation and evolve systems by pushing for changes that improve reliability and velocity. Skills & Qualifications Overall 10+ years of... 
    Work at office

    Info Way Solutions

    Chicago, IL
    3 days ago
  •  ...Senior/Staff Site Reliability Engineer, Consumer Apps Chicago, IL; Redwood City, CA About Attain Built for consumers and companies, alike Klover's engineering team powers one of the fastest-growing fintech platforms in the U.S., supporting over one million... 
    Senior
    Work at office
    Immediate start
    Remote work

    Attain

    Chicago, IL
    3 days ago
  • $127k - $249k

     ...Central time zones. We are looking for an experienced Senior Engineer for our SRE, Atlas team to support, maintain and grow the Atlas...  ...workloads. Role Overview We are seeking a talented Site Reliability Engineer (SRE) with a strong infrastructure background. This... 
    Senior
    Local area
    Remote work
    Worldwide
    Flexible hours

    MongoDB

    Chicago, IL
    2 days ago
  • CME Chicago Mercantile Exchange Inc. is seeking a Site Reliability Engineer III to enhance stability for CME Clearing & Risk. In this role, you will...  ...in cloud platforms like GCP, AWS, or Azure, alongside SRE principles, will be crucial in architecting operational fault... 
    Senior

    CME Chicago Mercantile Exchange Inc.

    Chicago, IL
    5 days ago
  • About the job We are looking for a senior site reliability engineer to join the Cloud FinOps team at Hopper. We manage a large infrastructure in Google...  ...team of SREs. An ideal candidate has Strong background in SRE, DevOps, Software Engineering or Systems engineering... 
    Senior
    Remote job
    Work from home
    Sleeping nights

    Hopper

    Chicago, IL
    5 days ago
  • $190k - $230k

     ...Android, and cloud. We are expanding the reliability engineering organization that powers Qira, Lenovo’s...  ...Personal AI. We are looking for Senior Site Reliability Engineers (SREs) to help...  ...support one of several teams within the SRE organization (e.g., Observability, Operations... 
    Senior
    Local area
    Remote work

    Lenovo

    Chicago, IL
    4 days ago
  • $130k - $140k

     ...capabilities and platform automation using Logic Apps and Python. #LI-VK1 Requirements 7+ years of experience in SRE, platform engineering, or cloud infrastructure engineering in large-scale enterprise environments. Deep, hands-on expertise with Microsoft... 
    Temporary work
    Work experience placement
    Work from home
    Flexible hours

    GlobalLogic

    Chicago, IL
    1 day ago
  •  ...Site Reliability Engineer in Wealth Management Chicago (IL) / Tempe (AZ) Onsite Job ROLE: This role will be Responsible for application...  ..., or technologies that would enhance business needs. As a SRE associate you will collaborate with Application Support and... 
    Flexible hours

    Info Way Solutions

    Chicago, IL
    3 days ago
  •  ...Edward Jones Site Reliability Engineer 100% remote Initial contract is 6 months, but will be a multi year engagement. Position Overview...  ...Mainframe/DB2, Oracle, MongoDB, Messaging systems(Kafka) and SRE tools like Dynatrace, and Splunk will be instrumental in... 
    Contract work
    Remote work

    HCL Global Systems

    Chicago, IL
    2 days ago
  • $130k - $150k

     ...Site Reliability Engineer - Disaster Recovery & Business Continuity Boston, MA, United States; Chicago, IL, United States About Charles River...  .... Position Overview The Site Reliability Engineer (SRE) helps ensure CRA's critical business services are reliable,... 
    Work at office
    Work from home
    3 days per week

    Charles River Associates

    Chicago, IL
    3 days ago
  • $91k - $110k

     ...collaboration, innovation, and personal growth. Be part of a team that makes a real difference. Job Description The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of critical technology services and... 
    Full time
    Part time
    Local area
    Remote work
    Monday to Friday
    Flexible hours
    Weekend work

    Genex Services

    Chicago, IL
    3 days ago
  • $100.7k - $167.8k

     ...Job Summary The Site Reliability Engineer III is a pivotal architect of stability for CME Clearing & Risk. You will engineer secure, scalable...  ...managing data layers like Oracle, Postgres, or BigQuery. SRE DNA: A profound understanding of SRE principles, specifically... 
    Worldwide

    CME Group

    Chicago, IL
    3 days ago
  • $127.33k - $159.17k

     ...Service Management. It's our goal to always provide an engaging, relevant, and simple experience for our customers. The Site Reliability Engineer (SRE) - Edge Platform is a key member of the Edge Operations and SRE team within Global Technology Infrastructure &... 
    Local area
    Flexible hours
    Shift work

    McDonald's Corporation

    Chicago, IL
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Sr. Site Reliability Engineer (SRE). Be the first to apply!