Principal Site Reliability Engineer (SRE)
INFINITE CHOICE LLC
Job Description
Job Description
About the Role
We're seeking an exceptional Principal Site Reliability Engineer to architect, design, and build our SRE foundation from the ground up at InfiniteChoice. This is a rare greenfield opportunity to establish SRE practices, develop custom tooling, and create the reliability culture that will support our platform serving millions of users and billions in transaction volume.
As our Principal SRE, you'll combine deep technical expertise with strategic vision to build world-class monitoring, observability, and automation systems. You'll have the autonomy to define our SRE processes, select technologies, and create the framework that ensures our systems are reliable, scalable, and performant.
Location: Remote - US based
Build SRE practices from scratch - define SLIs, SLOs, error budgets, and reliability metrics
Establish incident response procedures, on-call rotations, and post-mortem processes
Create reliability engineering standards and best practices across all engineering teams
Develop disaster recovery and business continuity strategies
Design and implement capacity planning and performance optimization frameworks
Drive architecture decisions for comprehensive application and infrastructure monitoring solutions
Design and develop custom SRE tools for automated monitoring, alerting, and remediation
Build observability platforms that provide deep insights into system performance and user experience
Create automation frameworks for deployment, scaling, and incident response
Architect logging, metrics, and tracing systems for distributed microservices environments
Leverage Google Cloud Platform services to build resilient, scalable infrastructure
Implement cloud-native monitoring using Stackdriver, Cloud Monitoring, and Cloud Logging
Design auto-scaling and self-healing systems using GKE, Cloud Functions, and managed services
Optimize cloud costs while maintaining high availability and performance standards
Establish security and compliance frameworks within GCP environments
Research and implement cutting-edge SRE tools and methodologies
Leverage AI and machine learning for predictive analytics, anomaly detection, and automated remediation
Create dashboards and reporting systems that provide actionable insights to engineering and business teams
Establish feedback loops for continuous improvement of reliability and performance
Stay current with industry best practices and emerging technologies in the SRE space
12+ years of experience in Site Reliability Engineering or Infrastructure Engineering
5+ years in lead SRE roles building and scaling SRE teams and processes
Proven track record designing and implementing monitoring and observability solutions at scale
Deep understanding of distributed systems, microservices architectures, and cloud-native patterns
Experience with infrastructure as code, configuration management, and deployment automation
Hands-on experience with Google Cloud Platform is required
Expertise with GCP monitoring and observability stack (Cloud Monitoring, Cloud Logging, Cloud Trace)
Experience with GKE, Compute Engine, Cloud Functions, and other core GCP services
Knowledge of GCP networking, security, and compliance capabilities
Understanding of GCP cost optimization and resource management
Strong programming skills in Python, Go, Java, or similar languages
Experience with monitoring tools (Prometheus, Grafana, Datadog, New Relic, or similar)
Proficiency with containerization (Docker, Kubernetes) and orchestration platforms
Knowledge of CI/CD pipelines, automated testing, and deployment strategies
Understanding of database performance tuning and optimization (SQL and NoSQL)
Familiarity with AI-driven development tools and methodologies is a huge plus
Experience with machine learning for operations (AIOps), anomaly detection, or predictive analytics
Knowledge of automated incident response and self-healing systems
Understanding of AI/ML tools for log analysis, pattern recognition, and intelligent alerting
Strong analytical and troubleshooting skills for complex distributed systems
Experience with high-pressure incident response and crisis management
Detail-oriented with commitment to operational excellence and continuous improvement
Comfortable with ambiguity and building processes in a fast-growing environment
Passion for reliability, automation, and engineering best practices
Demonstrated experience building SRE programs and processes from the ground up is a HUGE plus
Bachelor's degree in Computer Science, Engineering, or equivalent professional experience
Industry certifications (Google Cloud Professional, SRE or related certifications preferred)
Ground-floor opportunity to build SRE practices and culture from scratch
Full autonomy to define processes, select technologies, and establish best practices
Direct impact on platform reliability serving millions of users
Opportunity to create lasting engineering culture and operational excellence
Remote-first culture with in-person meeting in Dallas, TX on need basis
Collaborative environment with smart, passionate engineers and cross-functional teams
Access to cutting-edge technologies and AI-driven development tools
Competitive compensation, equity participation, and comprehensive benefits
Join us in creating the SRE foundation that will power InfiniteChoice's next phase of growth. If you're passionate about reliability engineering, love building systems from scratch, and want to establish the operational excellence that scales with our business, we'd love to hear from you.
About InfiniteChoiceInfiniteChoice was founded to help people find the experiences they want simply and effortlessly. We leverage a new type of business model and platform that uniquely applies automation and technology to solve the challenges of scale and complexity in experience discovery.
Existing business and marketing technologies can no longer handle the demands of connecting millions of consumers with vast inventories of experiences across a fragmented, global marketplace of people, partners, and providers.
Our mission is to disrupt this status quo by creating seamless connections between consumers and experiences. We're just at the beginning of this journey, but our approach is working: we've helped over 275 million visitors connect to millions of experiences, generating over $2 billion in revenue for our brands and partners.
- ...Wells Fargo, we’re investing in senior engineering leadership to help shape and advance enterprise... ...critical business capabilities. As a Principal Engineer, you’ll operate at the... ...environment Background in observability and SRE practices (SLOs, error budgets, incident...PrincipalFull timeWork experience placementWork at officeVisa sponsorship3 days per week
- Role: Senior SRE Engineer Location: Washington DC - Hybrid Job Description We are seeking a... ...leveraging Davis AI and Grail to drive proactive reliability, mentoring cross-functional DevOps teams... .../Flexibility: Ability to work on-site in the Washington, DC area as required...SuggestedWork from homeFlexible hours
$122.1k - $198.3k
Associate Principal, Site Reliability Engineering Responsibilities Collaborate with development, operations and infrastructure teams to ensure availability... ...large language models (LLMs) to automate and optimize SRE workflows. This may include using AI‑powered tools to...PrincipalWork experience placementRemote work2 days per week$110k - $230k
...Great Company, Great Culture, Great Rewards and Great Careers. GEICO's Cyber Security Engineering & Analytics, Automation (SEA) team is seeking a Staff Cyber Site Reliability Engineer (SRE) — a hands-on, engineering-minded practitioner who is passionate about building...SuggestedHourly payFull timeWork experience placementLocal areaFlexible hours$90 per hour
CorGTA is seeking a Senior SRE Engineer in Dallas, Texas, to support production infrastructure. This role offers a contract to design and implement Kubernetes clusters, manage CI/CD pipelines, and ensure system observability using Azure and Terraform. Candidates should...SuggestedHourly payContract work$85 - $90 per hour
Senior SRE Engineer (AKS, Azure, Terraform, Kubernetes, and PowerShell.) JOB ID - 7933 Role: Senior SRE Engineer Location: Dallas / Fort Worth... ...$85-$90 per hour INC Structure: 8 Month contract *** 4 days on-site *** -- We have a great new opportunity to support one of our...Hourly payContract workWork experience placement- ...Job Summary:We are looking for an SRE L2 Engineer to support and maintain our Azure cloud-native infrastructure, ensuring high availability... ...closely with L3 and engineering teams to improve system reliability.Key Responsibilities: Incident & Problem Management:o Monitor...
- Cloud SRE Engineer - Associate Who We Look For: Goldman Sachs Engineers are innovators and problem-solvers who thrive in fast-paced global environments. We are seeking a motivated Cloud Site Reliability Engineer (SRE) to support the WM Data Engineering ecosystem. In this...
- ...building the infrastructure, tooling, and engineering culture to scale both our platform and... ...We are seeking a highly capable SRE / Support Developer to operate at the intersection... ...support, software engineering, and site reliability. This is not a traditional support role...Work at officeImmediate start3 days per week
- ...Senior Site Reliability Engineer (SRE) — Combination of deep operational expertise and hands-on engineering ability. The majority of your time (~70%) will be focused on owning the reliability, availability, scalability, and operational excellence of the cloud infrastructure...
- Position Overview: The primary responsibility of the Senior Site Reliability Engineer (SRE) is to lead reliability engineering initiatives across our Azure estate and Command Center operations. This role focuses on scripting, automation, and observability to ensure uptime...Shift workNight shift
- System Reliability Engineer (SRE) 1 —> 3 to 5 years experience Location :- Kansas City, Mi or Atlanta, GA or Dallas, Texas Job Description We are seeking an experienced System Reliability Engineer (SRE) 1 to join our team. The ideal candidate will have 3 to 5 years of...
$147.76k - $221.64k
...better world, so we can all enjoy living in it. Engineering Manager, IAM Platform (Ops, SRE & AI Enablement) We are seeking a strategic Engineering... ...from traditional operations to a modern Site Reliability Engineering (SRE) model. You will lead the charge in...Hourly payTemporary workPart timeRelocationRelocation packageFlexible hours$136k - $204k
...unified modeling team in Network Platform Engineering is building this unified network model... ..., test, and operation of highly reliable services and software to model network... ...sommes à la recherche d’un ingénieur principal ou d’une ingénieure principale expériment...PrincipalFull timeTemporary workWork at office$103.5k - $172.5k
Overview SeniorManager, Site Reliability Engineering The Site Reliability Engineering Manager is responsible for overseeing the daily operations and... .... In addition to managing operational aspects, the SRE Sr.Manager actively contributes to the technical direction...Contract workTemporary workShift work$85 - $95 per hour
...PTR Global is seeking a Principal Engineer – Platform Engineering and Production Support in Irving, TX. This role focuses on ensuring application... .... The ideal candidate has extensive experience in DevOps, SRE, and application support in cloud environments, particularly...PrincipalHourly payContract work$122.1k - $198.3k
The Options Clearing Corporation (OCC) in Dallas, Texas, is seeking a Site Reliability Engineer to enhance their Ovation platform's performance and reliability through automation. This role requires collaboration with development and operations teams, along with experience...Principal- ...Information Technology group delivers secure, reliable technology solutions that enable DTCC... ...the technical leader responsible for Site Reliability Engineering across IAM platform, overseeing and... ...: Lead and Implement SRE across all IAM platforms and ensure availability...
- ...Description Forhyre is looking for engineers who can bring unique... ...while building a culture of reliability and observability Engage in... ...subject matter expert in an SRE mindset, best practices, and... ...Skills We are looking for Principal SRE with proven experience in...
- ...Principal Engineer - Platform Engineering & Production Support Team Overview This... ...key role post-deployment, ensuring reliability, performance, and operational... ...candidate is a strong DevOps and Site Reliability Engineering (SRE) professional with hands-on expertise...PrincipalFor contractorsShift work
- A leading technology company is looking for a System Reliability Engineer (SRE) 1 to ensure the reliability, scalability, and performance of their systems. The ideal candidate should have 3 to 5 years of experience in the SRE role, strong knowledge of system architecture...
$86.09 - $94.09 per hour
...Genesis10 is currently seeking a Principal Engineer - Platform Engineering for a contract position with a Global Financial Institution... ...release. The ideal candidate is a strong DevOps and Site Reliability Engineering (SRE) professional with hands-on expertise in observability...PrincipalHourly payPermanent employmentContract workShift work- ...Principal Systems Engineer – Messaging Platforms Are you ready to make an impact at DTCC?... ...Technology group delivers secure, reliable technology solutions that enable... ...platform capabilities Drive a Site Reliability Engineering (SRE) mindset, emphasizing reliability...PrincipalRemote workFlexible hours
- ...accepted only through our careers site by directly applying to the... ...and experienced Associate Principal to join our Incident Management... ...have experience in cloud engineering, container orchestration and... ...development, deployment, and SRE teams to ensure seamless integration...PrincipalRemote work2 days per week
- A global investment banking firm seeks a Cloud Site Reliability Engineer (SRE) in Dallas, Texas. As a Cloud SRE Engineer, you will ensure the resilience and scalability of cloud-native services on AWS and lead initiatives for cloud migration. Ideal candidates will have...
- Goldman Sachs is seeking a motivated Cloud Site Reliability Engineer (SRE) in Dallas, Texas. The candidate will be responsible for ensuring the resilience and scalability of cloud-native services on AWS. Key responsibilities include defining SLOs, implementing AI-driven...Full time
- Site Reliability Engineer (Chicago, IL; Dallas, TX; ...) Qualifications: 8+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of: work experience, training, experience, education. Contractor will implement and maintain scalable...Contract workFor contractorsWork experience placement
- Required Skills AWS/Azure/GCP (GCP is not used very much) Kubernetes Helm Docker Gitlab Grafana Cyberark/Hashicorp Vault Terraform etc. Experience Experience utilizing Java, Perl, Python, Go and scripting experience in Shell and Perl to automate reports and monitor enterprise...
- ...Internal Developer Platform (IDP) as a product, treating engineering teams as customers and optimizing for reliability, usability, and delivery velocity. Define and... ...4+ years of experience in Platform Engineering, Site Reliability Engineering, DevOps or Systems...Temporary work
- A leading financial institution in Dallas is looking for a Senior Platform Engineer to manage and enhance their Azure and Databricks platforms. The successful candidate will provide technical oversight, ensure high-quality service delivery, and lead incident management...Flexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Principal Site Reliability Engineer (SRE). Be the first to apply!
- chief design engineer Dallas, TX
- principal infrastructure engineer Dallas, TX
- civil engineer project manager Dallas, TX
- principal data engineer Dallas, TX
- chief engineer Dallas, TX
- principal developer Dallas, TX
- director data engineering Dallas, TX
- general engineer Dallas, TX
- director quality engineering Dallas, TX
- senior chief engineer Dallas, TX


