Principal Software Engineer - AI Platform (Production Engineering / Reliability)
$144.2k - $288.4kOak St. Health
Principal Individual Contributor (IC) For Ai Platform
We're building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you'll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.
We are seeking a Principal Individual Contributor (IC) to lead production engineering, observability, and operational excellence for our AI Platform. This role sits at the intersection of ML systems, distributed infrastructure, and production reliability, ensuring that our AI services are scalable, observable, and resilient in real-world environments.
As a senior technical leader, you will define and drive best-in-class production practices, build robust monitoring and alerting ecosystems, and partner across engineering, ML, and platform teams to ensure mission-critical AI systems meet high availability, performance, and reliability standards.
Key Responsibilities
- Own and evolve production operations strategy for AI/ML platforms and services
- Define SLOs, SLIs, and error budgets for AI systems (online & batch/inference pipelines)
- Lead root cause analysis (RCA) and drive systemic improvements post-incident
- Establish operational readiness standards for launching new AI capabilities
- Build frameworks for on-call excellence, incident response, and escalation
Observability, Monitoring & Alerting
- Design and implement end-to-end observability systems across AI workloads:
- Model performance monitoring
- Data pipeline health
- Infrastructure metrics
- Build and scale monitoring and alerting frameworks using modern tooling (e.g., Prometheus, Grafana, OpenTelemetry, Datadog, Azure Monitor, etc.)
- Define actionable, low-noise alerts tied to business and system impact
- Develop dashboards and telemetry standards for real-time visibility across services
- Drive adoption of golden signals (latency, errors, throughput, saturation) in AI systems
AI/ML Production Systems Excellence
- Ensure reliable deployment and operation of:
- Real-time inference services
- Model pipelines (training, validation, deployment)
- Data ingestion and feature pipelines
- Implement model observability (drift detection, data skew, performance degradation)
- Partner with ML engineers to improve production readiness of models
- Establish lifecycle standards for models in production environments
Automation & Platform Development
- Build internal platforms and tooling for:
- Automated incident detection and response
- Self-healing systems
- Deployment validation and canarying
- Drive Infrastructure as Code (IaC) and policy automation
- Improve system resilience through chaos testing and fault injection
Technical Leadership & Strategy
- Act as a trusted technical advisor across platform, ML, and product teams
- Set direction for operational excellence in AI systems at org scale
- Mentor senior engineers and influence cross-team architectural decisions
- Lead adoption of industry best practices in reliability engineering and observability
Required Qualifications
- 10+ years in software engineering, production engineering, or SRE roles
- Deep experience operating large-scale distributed systems in production
- Proven track record building monitoring, observability, and alerting systems
- Strong expertise in incident management and production support models
- Experience working with cloud platforms (Azure, AWS, GCP)
Preferred Qualifications
- Experience supporting AI/ML platforms or data-intensive systems
- Familiarity with model lifecycle management and MLOps practices
- Knowledge of:
- OpenTelemetry, Prometheus, Grafana, Datadog
- Kubernetes and containerized workloads
- Streaming systems (Kafka, Event Hub, etc.)
- Experience defining and implementing SLO-driven engineering
- Background in high-availability, low-latency systems
Key Competencies
- Systems thinking and ability to reason about complex, interdependent systems
- Strong bias for automation, scalability, and long-term solutions
- Exceptional debugging and incident management skills
- Ability to influence without authority across multiple teams
- Passion for operational excellence and reliability
The typical pay range for this role is: $144,200.00 - $288,400.00. This pay range represents the base hourly rate or base annual full-time salary for all positions in the job grade within which this position falls. The actual base salary offer will depend on a variety of factors including experience, education, geography and other relevant factors. This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above. This position also includes an award target in the company's equity award program.
Our people fuel our future. Our teams reflect the customers, patients, members and communities we serve and we are committed to fostering a workplace where every colleague feels valued and that they belong.
Great benefits for great people. We take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families. This full-time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well-being of colleagues and their families. The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.
- ...Principal Software Engineer Join a forward-thinking team at JPMorgan... ...the future of cloud platform engineering. As a... ...power our data and AI initiatives. You'll... ...solutions that are secure, reliable, and scalable. This... ...improve developer productivity through clear...SuggestedWork at officeShift work
- ...Senior Principal Software Engineer We're looking for a tech leader... ...Global Customer Platform, you provide deep engineering... ...-leading technology products in a secure, stable,... ...Experience applying AI/ML technologies to... ..., and production reliability practices for...Suggested
$145.6k - $209.3k
...workforce operating platform. Helping people get... ...We are seeking a Principal Cloud Platform Software Engineer in Enterprise Solutions... ...to ensure system reliability and performance.... ...Participate in production support, troubleshooting... ..., and people-first AI, our ability to reveal...SuggestedLocal area- ...Principal Software Engineer- Contact Center Platform The Enterprise Contact Center Technology team is seeking a Principal... ...recognition and conversational AI integrations. You will build and... ...operates on a 24x7 basis—making reliability, scalability, and performance...Suggested
$261.5k - $353.5k
...a global technology platform that helps our customers... ...units, is seeking a Principal Software Engineer to lead the long-term... ...distributed systems, AI/GenAI, and financial... ...that must be highly reliable, explainable, and globally... ...across engineering, product, data science,...SuggestedLocal areaWorldwide$139.9k - $274.8k
...mission to redefine how software is built and... ...the foundational platforms, services, and developer... ...generation of AI-driven... ...that accelerates product learning and drives... ...services that empower engineers and scientists across... ...architecture, service reliability, and experimentation...Ongoing contractLocal area$200k - $285k
...enabling human life on Mars. PRINCIPAL SOFTWARE ENGINEER (PLATFORM TEAM) The Platform Team... ...team at SpaceX to harness AI effectively. This team... ..., this team unlocks reliable, high-impact AI capabilities... ...critical to accelerating SpaceX production and development by making...Permanent employmentTemporary work$96k - $163k
...deliver a unique set of products and services that... ...Summary Senior Site Reliability Engineer Job Description:... ...steward for the platform. This is accomplished... ..., Dynatrace, etc, AI Tools – MS Copilot developer... ...for promoting software into higher environments...Full timePart timeImmediate startWorldwideFlexible hoursShift workWeekend work$124k - $156k
...intelligence, while increasing productivity, visibility, accuracy,... ....Job Description:As a Principal Software Engineer on the Platform Services team, you will... ..., you will own the reliability, observability, and modernization... ...state.You will bring an AI-first mindset —...Work experience placementImmediate start- ...Sr. Principal Architect - Platform Engineering Our world is transforming, and... ...leading the way. Our software brings the physical... ..., create better products, and empower people... ...elevate performance, reliability, developer experience... ...integration, AI agent and operational...Local areaImmediate startFlexible hours
- ...Principal Software Engineer, Architecture - Platform Modernization isolved is executing a live, in-production decomposition of a 7M+ line HCM monolith into... ...ensure consistency and reliability across the platform... ...organization Using AI-assisted development tools...Live in
$147k - $210k
...The Tech Platform team is the backbone... ...of Enova's engineering organization -... ...that power our products, brands, and engineering... ...and Platform Software Engineering,... .... As a Principal Engineer on... ...responsible use of AI across the... ...improve platform reliability, developer...Full timeSummer workWork at officeLocal areaRemote workMonday to Friday$168.93k - $192.5k
...Overview We are seeking a Site Reliability Engineer to join our Core Platform Engineering organization.... ...and security in an AI-accelerated development... ...partnering closely with Software Engineering teams to foster... ...Compliance teams to ensure production systems meet FedRAMP,...Full timeTemporary workWork at officeRemote workFlexible hours$249k
...world. Join us. Principal Software Development Engineer Our Technology... ...create innovative products, services, and... ...singular technology platform powered by cloud and... ...without sacrificing reliability or skyrocketing our... ...While we are embracing AI tools, your job is...Local areaFlexible hoursWeekend work- ...The Senior Principal Engineer will lead the technical... ...high-performance platform tools, such as BioFlow... ...building robust software engineering... ...and integrating AI/Agent capabilities... ...Partner closely with Product Managers to... ...enterprise-grade reliability. Ensure optimal...
$143k - $286k
...'ll do... As a Principal Engineer in Walmart's Fraud and Risk platform, you will define and... ..., agentic AI, and human-in-the-... ...Data scientists and Product managers who love... ...requirements for latency, reliability, and scalability... .... We're a team of software engineers, data...Full timeTemporary workPart time- ...Java Backend Engineer Lead the design, development... ...our multi-tenant platform. Architect and optimize... ...and engineering for reliability. Develop and... ...guardrails for developer productivity and service reliability... ...of experience in Software Engineering with a strong...
$210k - $230k
...RETAIL TRADING PLATFORM IN THE WORLD... ...provide cutting-edge products and services that... ...award-winning software and brokerage services... ...platform. As a Principal Software Engineer, you will lead... ...-time system reliability Bonus points... ...Experience leveraging AI/ML tools or...Work at officeRemote workWorldwideMonday to FridayFlexible hours$168k - $247k
...Principal Software Engineer – Test Platform Boston, MA We're a high-tech home security... ...ecosystem — spanning the full product stack: connected cameras... ...journeys function reliably before reaching production... ...leveraging data-driven and AI-assisted approaches to improve...Work at officeShift work$230k - $270k
...foundation for agent engineering in the real world,... ...prototypes to production-ready AI agents that teams... ...grown to also offer a platform for building,... ...evaluation, and production reliability of AI systems.... ...'re looking for a Principal/Lead level Software Engineer to join...Work at officeFlexible hours$147k - $237.5k
...and Inclusion. We weave AI into the fabric of... ...is the observability platform built for control in the... ...for an Infrastructure Engineer to help ensure all components... ..., improving developer productivity across the company.... ...highly scalable and reliable distributed systems....Remote work$126.9k - $215.3k
...06/03/2026 The Principal Engineer, Guest Communications Platform serves as the technical... ...platform that reliably delivers Email, SMS... ...broad influence across product, platform, DevOps,... ...As a Principal Software Engineer within our... ...~ Experience with AI coding and productivity...Full timeRemote workFlexible hours$210k - $247k
...traditional options like engines, turbines, and... ...to quickly and reliably deliver local... ...delivers its products to customers across... ...-low emissions platform delivering... ...opportunity for a Principal Cloud Platform Software Engineer to join... ...infrastructure Embed AI into engineering...Local areaRemote workFlexible hours$184.3k - $247.1k
...Principal Software Engineer - Ad Platform Join to apply for the Principal Software Engineer... ...streaming and digital products in new and immersive ways... ...for performance, cost, and reliability, leveraging cutting-edge... ..., started with the help of AI. #J-18808-LjbffrHourly pay16 hoursFull timeRemote workWorldwide- ...Kai is the AI company rebuilding cybersecurity... .... The Kai Agentic AI Platform replaces fragmented,... ...: Our Heads of AI, Engineering, and Product bring extensive experience... ...to "it scales, it's reliable, it's maintainable,... ...of professional software engineering experience...Contract work
$190k - $200k
...Description Principal Software Engineer - AI Platform Full-time Remote Exclusive confidential search - details shared with qualified... ...capabilities that enable scalable, secure, and reliable AI-powered product experiences. You will influence cross-team...Hourly payFull timeRemote work$2,000 per month
...Elastic, the Search AI Company, enables... ...Elastic Search AI Platform, used by more than... ...part of the Platform Engineering department, the... ...to guarantee the reliability of the global Elastic... ...and maintaining software, codebases, tooling... ...Engineering with product success in delivering...Local areaRemote workFlexible hours- ...Riot engineers bring deep knowledge of... ...domains. As a Software Engineer, you'll... ...value. As a Principal Software Engineer... ...-by providing reliable, scalable frameworks... ...quality of our products. We... ...use-cases and platforms. ~ Have been... ...machine learning, AI, and/or functional...Local areaFlexible hours
- ...of this with something engineers actually trust and... ...are seeking a Senior Platform Engineer, Delivery & Reliability to own the trusted software delivery and reliability... ...platform that enables product teams to move from idea... ...or build scripts. As AI increasingly generates...
$168.93k - $192.5k
...Overview We are seeking a Site Reliability Engineer to join our Core Platform Engineering organization.... ...and security in an AI-accelerated development... ...partnering closely with Software Engineering teams to foster... ...Compliance teams to ensure production systems meet FedRAMP,...Full timeTemporary workWork at officeRemote workFlexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Principal Software Engineer - AI Platform (Production Engineering / Reliability). Be the first to apply!
- senior principal software engineer United States
- principal software engineer manager United States
- principal software engineer United States
- client platform engineer United States
- senior platform engineer United States
- data platform engineer United States
- platform engineering manager United States
- platform developer United States
- platform engineer United States
- machine learning ai engineer United States



