Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Sr. Site Reliability Engineer (SRE)

$165k - $225k

Moonlite AI

Sr. Site Reliability Engineer (SRE)

Chicago, IL or Remote

Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads. We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.

Your Role:

You will be instrumental in building and operating production-grade AI infrastructure with deep Kubernetes expertise at its core. Working closely with our systems engineers, network engineers, and platform engineering team, you'll architect and operate the Kubernetes infrastructure that powers our control plane and orchestrates compute, storage, and networking at scale. This role requires deep understanding of Kubernetes internals, custom resource definitions (CRDs), storage and network integrations, and building production-grade clusters from the ground up (not just deploying in managed environments). You'll ensure enterprise-grade reliability while establishing the automation, observability, and operational practices.

Job Responsibilities
  • Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads.
  • Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads.
  • Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains.
  • GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization.
  • Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement.
  • Infrastructure Automation: Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions.
  • Production Operations & Reliability: Manage production bare-metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments.
  • Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR.
  • Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads.
Requirements
  • Experience: 5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale.
  • Kubernetes Infrastructure Expertise: Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies.
  • Kubernetes Internals & Integration: Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes.
  • Linux Systems Experience: Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments.
  • Infrastructure Automation: Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead.
  • Networking Fundamentals: Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production.
  • Observability & Monitoring: Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems.
  • Reliability Practices: Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems.
  • Scripting & Automation: Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency.
  • Problem-Solving Under Pressure: Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages.
  • Collaboration & Communication: Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers.
Preferred Qualifications
  • Experience building custom Kubernetes operators or controllers for infrastructure orchestration
  • Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management
  • Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins
  • Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions
  • Experience with Kubernetes cluster federation or multi-cluster management
  • Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes
  • Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar)
  • Familiarity with configuration management at scale and GitOps practices
  • Understanding of security best practices for Kubernetes and bare-metal infrastructure
  • Experience operating infrastructure in regulated industries or co-located data center environments
  • Background supporting research institutions, technical computing environments, or enterprise AI infrastructure
Key Technologies
  • Kubernetes, Linux, Terraform, Ansible, Prometheus, Grafana, ELK Stack, Go, Python, Bash, NVIDIA GPU Technologies, High-Performance Networking, Enterprise Storage Systems
Why Moonlite
  • Build Critical Research Infrastructure: Your work will directly enable quantitative research teams and AI practitioners to push the boundaries of what's possible in financial modeling and AI research.
  • Enterprise Impact: Build and operate infrastructure that supports mission-critical research and AI workloads for leading financial institutions and research organizations.
  • Technical Excellence: Join an infrastructure team focused on delivering enterprise-grade reliability while pushing the boundaries of high-performance computing capabilities.
  • Hands-On Ownership: As part of our growing infrastructure team, you'll have significant ownership over critical systems and the autonomy to influence our operational practices and technology choices.
  • Industry Leadership: Work alongside experienced infrastructure professionals who have built and operated systems for the most demanding computing environments.

We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together.

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Sr. Site Reliability Engineer (SRE) in United States vacancy
  •  ...partnered with our client in their search for a Senior SRE to work CST hours. Responsibilities Applies software engineering practices to IT operations tasks to maintain a scalable and reliable production environment for running software services create... 
    Senior
    Work experience placement
    Remote work

    Korn Ferry

    Chicago, IL
    2 days ago
  • $132.4k - $220.6k

     ...Sr. Database Site Reliability Engineer (DB SRE) McKesson is an impact-driven, Fortune 10 company that touches virtually every aspect of healthcare. We are known for delivering insights, products, and services that make quality care more accessible and affordable. Here... 
    Senior
    Remote work

    McKesson

    Kings Mills, OH
    4 days ago
  •  ...startups across the US. We’re building a pool of world-class Site Reliability Engineers for current roles and for upcoming opportunities. You will...  ...into one of our partner startups or added to our vetted SRE network for future projects. This role is ideal for engineers... 
    Senior
    Local area

    Breakout Tools

    San Francisco, CA
    3 days ago
  •  ...Senior Site Reliability Engineer - Operations As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered...  ...We are seeking a highly skilled Site Reliability Engineer (SRE) to join our Operations team. In this role, you will be... 
    Senior
    Ongoing contract
    Casual work
    Remote work
    Flexible hours

    SS&C Technologies Holdings

    Missouri
    4 days ago
  • $140k - $160k

     ...platform in the cross-border e-commerce sector, is expanding its North America operations. We're seeking a Senior DevOps Engineer / Site Reliability Engineer (SRE) to architect and maintain our unified global O&M (operations and maintenance) platform. This is a newly... 
    Senior
    Full time
    Immediate start
    Remote work

    Thomas Talent Network

    Raleigh, NC
    21 days ago
  • Senior Manager - Site Reliability Engineering (SRE) job at Wise Skulls llc. Kansas City, MO. Title: Senior Manager - Site Reliability Engineering (SRE) Location: Kansas City, MO (Hybrid) Duration: 6 months (possibility of extension) Implementation Partner: Infosys End Client... 
    Senior

    Itlearn360

    Kansas City, MO
    2 days ago
  •  ...Senior Site Reliability Engineer – Google Distributed Cloud Edge (Edge SRE) Location: Hybrid – Chicago, IL (preferred) Employment Type: W2, Contract to Hire, Direct Hire Overview Our client is seeking a highly skilled Edge Site Reliability Engineer (Edge SRE... 
    Senior
    Contract work

    CoSourcing Partners - Enterprise-AI and IT Services Company

    Chicago, IL
    1 day ago
  • Role: Senior SRE Engineer Location: Washington DC - Hybrid Job Description We are seeking a...  ...leveraging Davis AI and Grail to drive proactive reliability, mentoring cross-functional DevOps teams...  .../Flexibility: Ability to work on-site in the Washington, DC area as required... 
    Senior
    Work from home
    Flexible hours

    Vytwo

    Dallas, TX
    4 days ago
  • ## Job Description# Senior Site Reliability Engineer (SRE)* Perform full-stack triaging of alerts to identify the root cause of application performance and stability issues.* Work with stakeholders to define and track service level objectives (SLOs) for application features... 
    Senior
    Work experience placement

    Apex Systems

    Plano, TX
    4 days ago
  • $99.09k - $123.86k

    Position Overview We’re seeking a seasoned Site Reliability Engineer (SRE) who thrives at the intersection of software engineering, infrastructure, and AI systems. You’ll help ensure our platforms are scalable, reliable, and secure while also contributing code, automation... 
    Senior
    Local area
    Flexible hours

    Voya Financial, Inc.

    Atlanta, GA
    1 day ago
  • # Senior Site Reliability Engineer (SRE)Apply**Job#: 3036882****Job Description:**Senior Site Reliability Engineer (SRE)**Location:** Chandler, AZ (Hybrid)**Employment Type:** Contract**Contract Duration**: 12 MonthsRole OverviewThis role is accountable for establishing... 
    Senior
    Hourly pay
    Contract work
    3 days per week

    Apex Systems

    Chandler, AZ
    1 day ago
  • $175k - $215k

     ...experiences — and we’re constantly looking for new ways to enhance these exciting experiences. Sr. Manager, Site Reliability Engineer provides strategic leadership across multiple SRE teams and their managers, ensuring alignment with organizational priorities and functional... 
    Senior

    Disney Experiences

    Orlando, FL
    9 days ago
  •  ...security to responsibly propel the global lottery industry ever forward. Position Summary We are looking for a skilled Site Reliability Engineer (SRE) to enhance the stability, performance, and reliability of our production systems. The SRE will work closely with... 
    Senior
    Permanent employment
    Work experience placement
    Local area

    SCIENTIFIC GAMES

    Alpharetta, GA
    more than 2 months ago
  • $202.8k - $327.63k

     ...lifecycle management (CLM). What you’ll do The Senior Director, SRE Platform Engineering is a senior engineering leader responsible for bringing...  ...vision for our IT Service Management (ITSM) and Site Reliability Engineering (SRE) capabilities, applying the same standards... 
    Senior
    Permanent employment
    Contract work
    Work at office
    Local area
    Remote work
    2 days per week

    DocuSign, Inc.

    San Francisco, CA
    1 day ago
  •  ...please send me a copy of your updated resumes Title: Sr. SRE / DevOps Engineer Location: Sunnyvale, CA (Only Local candidate) Client...  ...Engineer at Sunnyvale, California location. As Site Reliability Engineer, the individual will work closely with multi-functional... 
    Senior
    Local area
    Immediate start

    Donato Technologies Inc

    Sunnyvale, CA
    14 days ago
  •  ...Site Reliability Engineer (SRE) At Air Apps, we believe in thinking bigger—and moving faster. We're a family-founded company on a mission to create the world's first AI-powered Personal & Entrepreneurial Resource Planner (PRP), and we need your passion and ambition... 
    Remote work
    Worldwide

    Air Apps

    United States
    1 day ago
  •  ...Job Summary We are seeking an experienced Senior DevOps / Site Reliability Engineer (SRE) with strong application and infrastructure knowledge. The role requires hands-on expertise in AWS, Kubernetes, CI/CD, monitoring, and .NET-based applications to ensure high... 

    Prophecy Technologies

    Miami, FL
    22 hours ago
  • $100k - $200k

     ...OPPO US Research Center is seeking a skilled and proactive Site Reliability Engineer (SRE) to join our team. In this role, you will be responsible for ensuring the stability, scalability, and performance of our application systems. The ideal candidate is passionate about... 
    Full time

    OPPO

    Palo Alto, CA
    3 days ago
  •  ...Direct message the job poster from STAFFWORXS Delivery Manager @ STAFFWORXS | US IT Recruitment Job Opening: AWS Site Reliability Engineer (SRE) We’re hiring a Site Reliability Engineer (SRE) to join our team in Atlanta, GA. This hybrid role offers the... 
    Contract work

    Staffworxs Inc

    Atlanta, GA
    3 days ago
  •  ...automate, deploy, and operate highly reliable cloud systems supporting mission-critical...  ...role is centered on DevSecOps and site reliability engineering, with a strong emphasis on deployment...  ...years of professional experience as an SRE, DevOps, reliability, infrastructure,... 
    Permanent employment
    Remote work

    Quindar

    United States
    5 days ago
  • $93.9k - $156.5k

    CME Group Inc. is looking for a Site Reliability Engineer II in Chicago to assist in building, operating, and scaling systems. This role requires a keen interest in SRE and skills in Linux, programming, and problem-solving. Candidates will work with senior engineers and... 

    CME Group Inc.

    Chicago, IL
    3 days ago
  • $250.5k - $335.9k

     ...Sr Principal Site Reliability Engineer P5/P6: SRE Lead, Content Distribution Engineering Media Engineering. SF CA / LA CA / NYC Team Intro On any given day at Disney Entertainment & ESPN Technology, we're reimagining ways to create magical viewing experiences... 
    Senior
    Local area
    Worldwide

    Disney France

    Los Angeles, CA
    1 day ago
  •  ...Washington D.C., District of Columbia, United States About the job Sr. Site Reliability Engineer Our Client is currently hiring a full-time Sr. Site Reliability Engineer (SRE), who will play a vital role in continuously driving improvements in observability, performance... 
    Senior
    Full time
    Currently hiring
    3 days per week

    CruitZi

    Washington DC
    3 days ago
  • $150k - $200k

     ...Senior Site Reliability Engineer At favorited, we believe that digital communities should be more than just spaces to watch content. Our platform is a place to connect, engage, and play, and empowers creators by enhancing audience participation and fostering deeper... 
    Senior
    Full time

    Favorited

    Santa Monica, CA
    5 days ago
  • Overview Site Reliability Engineering (SRE) Architect — Atlanta, GA Duration: 12 Months+ Extension | Hourly Rate: DOE | Work Authorization: As an SRE Architect, you will be a pivotal technical leader responsible for designing, building, and evolving the foundational systems... 
    Hourly pay
    Permanent employment
    Local area
    Early shift

    Cloud Analytics Technologies, LLC

    Atlanta, GA
    3 days ago
  • $90k - $105k

    Evolving Solution Services in Denver, Colorado is seeking a Site Reliability Engineer to help maintain the health of its production environment. This mid-level role offers a clear trajectory to Lead SRE, collaborating closely with the engineering team and ensuring reliability... 

    Evolving Solution Services

    Denver, CO
    5 days ago
  • $115.28k - $196.13k

     ...Sr. Site Reliability Engineer- Hybrid We are Farmers – where ambition meets opportunity. At Farmers, we're not just known for unforgettable jingle...  ...and availability of production and test environments. SRE must implement and maintain technologies that improve the scalability... 
    Senior
    Work at office
    Flexible hours
    Shift work

    Farmers Inc

    Saint Louis, MO
    2 days ago
  •  ...Site Reliability Engineering (SRE) Platform Engineer (Lead) Job Number: 26-00672 Use your skills where innovative technology solutions begin. ECLARO is looking for a Site Reliability Engineering (SRE) Platform Engineer (Lead) for our client in Rochester, NY. ECLARO’s client... 
    Local area

    Eclaro

    Rochester, NY
    2 days ago
  • $84.24k - $142.48k

     ...collaboratively with our talented team of dynamic and passionate engineers to deliver capabilities that enable our customers to make a...  ...communities. Responsibilities Collaborate with a team of SRE engineers to operate SaaS capabilities across multiple regions on... 
    Senior
    Worldwide
    Flexible hours

    Esri

    Saint Louis, MO
    4 days ago
  • $232k - $263k

     ...scaling quickly toward long-term growth and IPO readiness. Join us as we define the future of SaaS security! Sr. Staff Site Reliability Engineer As a Sr. Staff SRE at Obsidian , you will define and drive the company-wide reliability vision for a complex, multi-tenant... 
    Senior
    Work from home
    Flexible hours

    Obsidian Security

    Palo Alto, CA
    7 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Sr. Site Reliability Engineer (SRE). Be the first to apply!