Principal Site Reliability Engineer (SRE)

INFINITE CHOICE LLC

Job Description

About the Role

We're seeking an exceptional Principal Site Reliability Engineer to architect, design, and build our SRE foundation from the ground up at InfiniteChoice. This is a rare greenfield opportunity to establish SRE practices, develop custom tooling, and create the reliability culture that will support our platform serving millions of users and billions in transaction volume.

As our Principal SRE, you'll combine deep technical expertise with strategic vision to build world-class monitoring, observability, and automation systems. You'll have the autonomy to define our SRE processes, select technologies, and create the framework that ensures our systems are reliable, scalable, and performant.

Location: Remote - US based

What You Will DoSRE Foundation & Process Development

Build SRE practices from scratch - define SLIs, SLOs, error budgets, and reliability metrics
Establish incident response procedures, on-call rotations, and post-mortem processes
Create reliability engineering standards and best practices across all engineering teams
Develop disaster recovery and business continuity strategies
Design and implement capacity planning and performance optimization frameworks

Architecture & Tool Development

Drive architecture decisions for comprehensive application and infrastructure monitoring solutions
Design and develop custom SRE tools for automated monitoring, alerting, and remediation
Build observability platforms that provide deep insights into system performance and user experience
Create automation frameworks for deployment, scaling, and incident response
Architect logging, metrics, and tracing systems for distributed microservices environments

Google Cloud Infrastructure Excellence

Leverage Google Cloud Platform services to build resilient, scalable infrastructure
Implement cloud-native monitoring using Stackdriver, Cloud Monitoring, and Cloud Logging
Design auto-scaling and self-healing systems using GKE, Cloud Functions, and managed services
Optimize cloud costs while maintaining high availability and performance standards
Establish security and compliance frameworks within GCP environments

Innovation & Continuous Improvement

Research and implement cutting-edge SRE tools and methodologies
Leverage AI and machine learning for predictive analytics, anomaly detection, and automated remediation
Create dashboards and reporting systems that provide actionable insights to engineering and business teams
Establish feedback loops for continuous improvement of reliability and performance
Stay current with industry best practices and emerging technologies in the SRE space

What You Must HaveSRE & Infrastructure Expertise

12+ years of experience in Site Reliability Engineering or Infrastructure Engineering
5+ years in lead SRE roles building and scaling SRE teams and processes
Proven track record designing and implementing monitoring and observability solutions at scale
Deep understanding of distributed systems, microservices architectures, and cloud-native patterns
Experience with infrastructure as code, configuration management, and deployment automation

Google Cloud Platform Proficiency

Hands-on experience with Google Cloud Platform is required
Expertise with GCP monitoring and observability stack (Cloud Monitoring, Cloud Logging, Cloud Trace)
Experience with GKE, Compute Engine, Cloud Functions, and other core GCP services
Knowledge of GCP networking, security, and compliance capabilities
Understanding of GCP cost optimization and resource management

Technical Skills

Strong programming skills in Python, Go, Java, or similar languages
Experience with monitoring tools (Prometheus, Grafana, Datadog, New Relic, or similar)
Proficiency with containerization (Docker, Kubernetes) and orchestration platforms
Knowledge of CI/CD pipelines, automated testing, and deployment strategies
Understanding of database performance tuning and optimization (SQL and NoSQL)

AI & Automation

Familiarity with AI-driven development tools and methodologies is a huge plus
Experience with machine learning for operations (AIOps), anomaly detection, or predictive analytics
Knowledge of automated incident response and self-healing systems
Understanding of AI/ML tools for log analysis, pattern recognition, and intelligent alerting

Problem-Solving & Mindset

Strong analytical and troubleshooting skills for complex distributed systems
Experience with high-pressure incident response and crisis management
Detail-oriented with commitment to operational excellence and continuous improvement
Comfortable with ambiguity and building processes in a fast-growing environment
Passion for reliability, automation, and engineering best practices
Demonstrated experience building SRE programs and processes from the ground up is a HUGE plus

Education

Bachelor's degree in Computer Science, Engineering, or equivalent professional experience
Industry certifications (Google Cloud Professional, SRE or related certifications preferred)

What We Offer

Ground-floor opportunity to build SRE practices and culture from scratch
Full autonomy to define processes, select technologies, and establish best practices
Direct impact on platform reliability serving millions of users
Opportunity to create lasting engineering culture and operational excellence
Remote-first culture with in-person meeting in Dallas, TX on need basis
Collaborative environment with smart, passionate engineers and cross-functional teams
Access to cutting-edge technologies and AI-driven development tools
Competitive compensation, equity participation, and comprehensive benefits

Ready to Build World-Class Reliability?

Join us in creating the SRE foundation that will power InfiniteChoice's next phase of growth. If you're passionate about reliability engineering, love building systems from scratch, and want to establish the operational excellence that scales with our business, we'd love to hear from you.

About InfiniteChoice

InfiniteChoice was founded to help people find the experiences they want simply and effortlessly. We leverage a new type of business model and platform that uniquely applies automation and technology to solve the challenges of scale and complexity in experience discovery.

Existing business and marketing technologies can no longer handle the demands of connecting millions of consumers with vast inventories of experiences across a fragmented, global marketplace of people, partners, and providers.

Our mission is to disrupt this status quo by creating seamless connections between consumers and experiences. We're just at the beginning of this journey, but our approach is working: we've helped over 275 million visitors connect to millions of experiences, generating over $2 billion in revenue for our brands and partners.

Apply

Vacancy posted more than 2 months ago

Similar jobs that could be interesting for youBased on the Principal Site Reliability Engineer (SRE) in Dallas, TX vacancy

Senior SRE (Site Reliability Engineer)
Role: Senior SRE Engineer Location: Washington DC - Hybrid Job Description We are seeking a... ...leveraging Davis AI and Grail to drive proactive reliability, mentoring cross-functional DevOps teams... .../Flexibility: Ability to work on-site in the Washington, DC area as required...
Suggested
Work from home
Flexible hours
Vytwo
Dallas, TX
1 day ago
Senior Cloud SRE & DevOps Engineer (Terraform + AWS)
Compunnel, Inc. is seeking a Senior Cloud Engineer to join the Cloud SRE team in Dallas, Texas. In this role, you will design and develop cloud solutions, ensuring platform reliability and engineering reliability tools. The ideal candidate will have over 7 years of software...
Suggested
Compunnel, Inc.
Dallas, TX
2 days ago
Principal Platform Engineer, SRE & Automation
Fairygodboss is seeking a Software Developer Principle based in Dallas, TX. In this role, you will lead the engineering improvements of enterprise technology platforms with a focus on system stability, scalability, and automation. The ideal candidate will have substantial...
Principal
Fairygodboss
Dallas, TX
11 hours ago
SRE Engineer - Compliance Platform & Automation
Goldman Sachs Bank AG is seeking a Site Reliability Engineer (SRE) in Dallas to oversee production services, ensuring system health and reliability. The role involves a mix of software and systems engineering, improving capacity through automation and effective management...
Suggested
Goldman Sachs Bank AG
Dallas, TX
3 days ago
Asset & Wealth Management-Cloud SRE Engineer-Associate-Dallas
Job Description Cloud SRE Engineer - Associate Who We Look For Goldman Sachs Engineers are innovators and problem-solvers who thrive... ...paced global environments. We are seeking a motivated Cloud Site Reliability Engineer (SRE) to support the WM Data Engineering ecosystem....
Suggested
Goldman Sachs
Dallas, TX
1 day ago
Manager, Site Reliability Engineering
...building product industries operate across the globe. We are looking for a Manager, Site Reliability Engineering to be part of revolutionizing these industries. We're looking for a hands‑on SRE leader to build and develop a high‑performing team that oversees reliability...
Paradigm
Irving, TX
1 day ago
Senior Site Reliability Engineer
Position Overview: The primary responsibility of the Senior Site Reliability Engineer (SRE) is to lead reliability engineering initiatives across our Azure estate and Command Center operations. This role focuses on scripting, automation, and observability to ensure uptime...
Shift work
Night shift
Las Vegas Sands Corp.
Dallas, TX
4 days ago
System Reliability Engineer (SRE) 1
System Reliability Engineer (SRE) 1 —> 3 to 5 years experience Location :- Kansas City, Mi or Atlanta, GA or Dallas, Texas Job Description We are seeking an experienced System Reliability Engineer (SRE) 1 to join our team. The ideal candidate will have 3 to 5 years of...
Highbrow LLC
Dallas, TX
3 days ago
Senior Manager, Site Reliability Engineering
$103.5k - $172.5k
Overview SeniorManager, Site Reliability Engineering The Site Reliability Engineering Manager is responsible for overseeing the daily operations and... .... In addition to managing operational aspects, the SRE Sr.Manager actively contributes to the technical direction...
Contract work
Temporary work
Shift work
JCPenney
Dallas, TX
1 day ago
Software Engineer Principal
...success. As a Software Developer Principle within PNC's Site Reliability Technology Engineering (SRTI) organization, you will be based in either Pittsburgh... ...to junior team members Partner with engineering, SRE, and platform teams to improve system design, observability...
Principal
Full time
Temporary work
Fairygodboss
Dallas, TX
4 days ago
SRE I: Reliability, Cloud & Automation Engineer
A leading technology company is looking for a System Reliability Engineer (SRE) 1 to ensure the reliability, scalability, and performance of their systems. The ideal candidate should have 3 to 5 years of experience in the SRE role, strong knowledge of system architecture...
Highbrow LLC
Dallas, TX
3 days ago
Compliance Engineering, Site Reliability Engineering, Vice President, Dallas
Compliance Engineering, Site Reliability Engineering, Vice President, Dallas Job Description We are Compliance Engineering, a global team of more than... ...uplift and rebuild the Compliance application portfolio. SRE at Goldman Sachs combines software and systems engineering...
Full time
Work at office
Goldman Sachs Group, Inc.
Dallas, TX
1 day ago
Senior Azure SRE & Automation Engineer
ISNetworld seeks an Advanced Site Reliability Administrator in Dallas, Texas, responsible for ensuring uptime and performance of cloud-based environments. You will manage both Windows and Linux systems, deploying resources effectively, and automating processes to maintain...
Work at office
Remote work
Flexible hours
ISNetworld
Dallas, TX
1 day ago
Senior SRE Engineer (AKS, Azure, Terraform, Kubernetes, and PowerShell.)
$85 - $90 per hour
...Role: Senior SRE Engineer Location: Dallas / Fort Worth, Texas Rate: up to $85-$90 per hour INC Structure: 8 Month contract *** 4 days on-site *** -- We have a great new opportunity to support one of our Consulting Services clients in a contract capacity...
Hourly pay
Contract work
Work experience placement
CorGTA
Dallas, TX
6 days ago
Cloud SRE Associate - AWS, SLOs & Predictive Observability
Goldman Sachs is seeking a motivated Cloud Site Reliability Engineer (SRE) in Dallas, Texas. The candidate will be responsible for ensuring the resilience and scalability of cloud-native services on AWS. Key responsibilities include defining SLOs, implementing AI-driven...
Full time
Goldman Sachs
Dallas, TX
1 day ago
Senior Site Reliability Engineer: Stability & Incidents
PNC Financial Services Group, Inc. is seeking a Senior Site Reliability Engineer for its SRC Lending organization in Dallas, TX. The role focuses on engineering stability, performance, and resiliency across production environments. Qualifications include a university degree...
PNC Financial Services Group, Inc.
Dallas, TX
2 days ago
Senior Principal Devops Engineer
...Overview JOB DESCRIPTION: The Cloud Solutions Network Engineer is part of the Cloud Center of Excellence (CCOE) and responsible for... ...track record for solution design and implementation of scalable, reliable, and high-performance application and database #J-18808-...
Principal
SnaKORPIO GROUP
Dallas, TX
11 hours ago
Principal Architect
...miles of Irving, TX- 3 days on site mandatory in a week Pay Range... ...seeking a highly experienced Principal Architect to lead enterprise-... ...DevSecOps, CI/CD, platform engineering, and release engineering initiatives... ...of DevSecOps, GitOps, SRE, and engineering excellence practices...
Principal
Immediate start
Artech
Irving, TX
2 days ago
Senior Platform Engineer - Azure & Databricks SRE
A leading financial institution in Dallas is looking for a Senior Platform Engineer to manage and enhance their Azure and Databricks platforms. The successful candidate will provide technical oversight, ensure high-quality service delivery, and lead incident management...
Flexible hours
Scotiabank
Dallas, TX
3 days ago
Software Engineer Principal Sr
## Software Engineer Principal SrApplylocations: TX - Farmers Branchtime type: Full timeposted on: Posted Yesterdayjob requisition id: R219556# **Position Overview**At PNC, our people are our greatest differentiator and competitive advantage in the markets we serve. We...
Principal
Full time
Temporary work
Part time
Work experience placement
Work at office
PNC Financial Services Group
Farmers Branch, TX
1 day ago
Site Reliability Engineer (Chicago, IL; Dallas, TX; San Jose, CA)
Site Reliability Engineer (Chicago, IL; Dallas, TX; ...) Qualifications: 8+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of: work experience, training, experience, education. Contractor will implement and maintain scalable...
Contract work
For contractors
Work experience placement
Cedent
Dallas, TX
2 days ago
Senior Principal Cloud Networking & DevOps Engineer
A leading technology company is seeking a Cloud Solutions Network Engineer to develop scalable cloud network solutions and lead cross-functional teams. The ideal candidate has extensive experience with AWS and cloud services, along with proven DevOps expertise. This role...
Principal
SnaKORPIO GROUP
Dallas, TX
3 days ago
Site Reliability Engineer
We are seeking an experienced Site Reliability Engineer to lead the migration of on‑prem applications to Cloud and to maintain the Cloud applications. This role is a hands‑on role involving design, coding, implementation of Azure Infrastructure and CI/CD pipelines. Furthermore...
Permanent employment
Contract work
Local area
Robotics Technologies LLC
Dallas, TX
3 days ago
Principal M&A Advisory & SAP S/4HANA Strategy Lead
GyanSys Inc. is seeking a Principal Enterprise Advisor in Dallas, Texas to lead large-scale strategic initiatives within the manufacturing sector. In this role, you will act as a trusted advisor to C-suite executives, overseeing multibillion-dollar acquisitions and enterprise...
Principal
GyanSys Inc.
Dallas, TX
11 hours ago
Principal Technical Sourcing COE Lead
AT&T is seeking a Principal Supply Chain Center of Excellence - Technical Sourcing in Dallas, TX to spearhead supply chain transformation through effective sourcing strategies and operational consistency. This role demands a strong background in strategic sourcing and the...
Principal
AT&T
Dallas, TX
2 days ago
Site Reliability Engineering
...Description Forhyre is looking for engineers who can bring unique... ...while building a culture of reliability and observability Engage in... ...subject matter expert in an SRE mindset, best practices, and... ...Skills We are looking for Principal SRE with proven experience in...
Forhyre
Dallas, TX
8 days ago
Principal Software Engineer Architect & Cross-Team Leader
The Hershey Company in Dallas is looking for a Principal Software Engineer to drive technical initiatives across multiple product-focused teams. You will have autonomy in shaping architecture while addressing complex production issues and mentoring other engineers. This...
Principal
Hershey
Dallas, TX
11 hours ago
Principal AI Engineer
...privacy and control. About This Opportunity We are looking for a Principal AI Engineer who builds things that matter. You will design and ship end... ..., and responsible AI controls to ensure production‑grade reliability and safety. Stay current with the rapidly evolving agentic...
Principal
Remote work
Flexible hours
H2O.ai
Dallas, TX
1 day ago
Strategic Tech Talent Lead — Principal Recruiter
Cotality, located in Dallas, TX, is hiring a Principal, Recruiter - Technology to drive the recruitment of transformative engineering professionals. This role combines strategic advisory with high-volume requisition management in a hybrid work model. You will partner with...
Principal
Cotality
Dallas, TX
4 days ago
Principal Software Engineer
...and Comparably’s Best Company Culture, Best Career Growth, Best Engineering Team, and Best Places to Work in Dallas, among others).... ...which will be indicated in the Job Title.Follow us on and !The Principal Engineer will build computer software systems, participating in...
Principal
Full time
Work experience placement
Remote work
Shift work
Alkami Technology
Dallas, TX
1 day ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal Site Reliability Engineer (SRE). Be the first to apply!