Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Principal Software Engineer - AI Platform (Production Engineering / Reliability)

$144.2k - $288.4k

Oak St. Health

Principal Individual Contributor (IC) For Ai Platform

We're building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you'll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.

We are seeking a Principal Individual Contributor (IC) to lead production engineering, observability, and operational excellence for our AI Platform. This role sits at the intersection of ML systems, distributed infrastructure, and production reliability, ensuring that our AI services are scalable, observable, and resilient in real-world environments.

As a senior technical leader, you will define and drive best-in-class production practices, build robust monitoring and alerting ecosystems, and partner across engineering, ML, and platform teams to ensure mission-critical AI systems meet high availability, performance, and reliability standards.

Key Responsibilities
  • Own and evolve production operations strategy for AI/ML platforms and services
  • Define SLOs, SLIs, and error budgets for AI systems (online & batch/inference pipelines)
  • Lead root cause analysis (RCA) and drive systemic improvements post-incident
  • Establish operational readiness standards for launching new AI capabilities
  • Build frameworks for on-call excellence, incident response, and escalation
Observability, Monitoring & Alerting
  • Design and implement end-to-end observability systems across AI workloads:
    • Model performance monitoring
    • Data pipeline health
    • Infrastructure metrics
  • Build and scale monitoring and alerting frameworks using modern tooling (e.g., Prometheus, Grafana, OpenTelemetry, Datadog, Azure Monitor, etc.)
  • Define actionable, low-noise alerts tied to business and system impact
  • Develop dashboards and telemetry standards for real-time visibility across services
  • Drive adoption of golden signals (latency, errors, throughput, saturation) in AI systems
AI/ML Production Systems Excellence
  • Ensure reliable deployment and operation of:
    • Real-time inference services
    • Model pipelines (training, validation, deployment)
    • Data ingestion and feature pipelines
  • Implement model observability (drift detection, data skew, performance degradation)
  • Partner with ML engineers to improve production readiness of models
  • Establish lifecycle standards for models in production environments
Automation & Platform Development
  • Build internal platforms and tooling for:
    • Automated incident detection and response
    • Self-healing systems
    • Deployment validation and canarying
  • Drive Infrastructure as Code (IaC) and policy automation
  • Improve system resilience through chaos testing and fault injection
Technical Leadership & Strategy
  • Act as a trusted technical advisor across platform, ML, and product teams
  • Set direction for operational excellence in AI systems at org scale
  • Mentor senior engineers and influence cross-team architectural decisions
  • Lead adoption of industry best practices in reliability engineering and observability
Required Qualifications
  • 10+ years in software engineering, production engineering, or SRE roles
  • Deep experience operating large-scale distributed systems in production
  • Proven track record building monitoring, observability, and alerting systems
  • Strong expertise in incident management and production support models
  • Experience working with cloud platforms (Azure, AWS, GCP)
Preferred Qualifications
  • Experience supporting AI/ML platforms or data-intensive systems
  • Familiarity with model lifecycle management and MLOps practices
  • Knowledge of:
    • OpenTelemetry, Prometheus, Grafana, Datadog
    • Kubernetes and containerized workloads
    • Streaming systems (Kafka, Event Hub, etc.)
  • Experience defining and implementing SLO-driven engineering
  • Background in high-availability, low-latency systems
Key Competencies
  • Systems thinking and ability to reason about complex, interdependent systems
  • Strong bias for automation, scalability, and long-term solutions
  • Exceptional debugging and incident management skills
  • Ability to influence without authority across multiple teams
  • Passion for operational excellence and reliability

The typical pay range for this role is: $144,200.00 - $288,400.00. This pay range represents the base hourly rate or base annual full-time salary for all positions in the job grade within which this position falls. The actual base salary offer will depend on a variety of factors including experience, education, geography and other relevant factors. This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above. This position also includes an award target in the company's equity award program.

Our people fuel our future. Our teams reflect the customers, patients, members and communities we serve and we are committed to fostering a workplace where every colleague feels valued and that they belong.

Great benefits for great people. We take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families. This full-time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well-being of colleagues and their families. The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.

Vacancy posted 2 days ago
Similar jobs that could be interesting for youBased on the Principal Software Engineer - AI Platform (Production Engineering / Reliability) in United States vacancy
  •  ...Principal Software Engineer Join a forward-thinking team at JPMorgan...  ...the future of cloud platform engineering. As a...  ...power our data and AI initiatives. You'll...  ...solutions that are secure, reliable, and scalable. This...  ...improve developer productivity through clear... 
    Suggested
    Work at office
    Shift work

    Chase

    Jersey City, NJ
    15 hours ago
  •  ...Senior Principal Software Engineer We're looking for a tech leader...  ...Global Customer Platform, you provide deep engineering...  ...-leading technology products in a secure, stable,...  ...Experience applying AI/ML technologies to...  ..., and production reliability practices for... 
    Suggested

    Chase

    Chicago, IL
    5 days ago
  • $145.6k - $209.3k

     ...workforce operating platform. Helping people get...  ...We are seeking a Principal Cloud Platform Software Engineer in Enterprise Solutions...  ...to ensure system reliability and performance....  ...Participate in production support, troubleshooting...  ..., and people-first AI, our ability to reveal... 
    Suggested
    Local area

    UKG

    Salt Lake City, UT
    15 hours ago
  •  ...Principal Software Engineer- Contact Center Platform The Enterprise Contact Center Technology team is seeking a Principal...  ...recognition and conversational AI integrations. You will build and...  ...operates on a 24x7 basis—making reliability, scalability, and performance... 
    Suggested

    Fidelity Corp

    Roanoke, TX
    24 days ago
  • $261.5k - $353.5k

     ...a global technology platform that helps our customers...  ...units, is seeking a Principal Software Engineer to lead the long-term...  ...distributed systems, AI/GenAI, and financial...  ...that must be highly reliable, explainable, and globally...  ...across engineering, product, data science,... 
    Suggested
    Local area
    Worldwide

    Intuit

    Mountain View, CA
    2 days ago
  • $139.9k - $274.8k

     ...mission to redefine how software is built and...  ...the foundational platforms, services, and developer...  ...generation of AI-driven...  ...that accelerates product learning and drives...  ...services that empower engineers and scientists across...  ...architecture, service reliability, and experimentation... 
    Ongoing contract
    Local area

    Microsoft Corporation

    Redmond, WA
    4 days ago
  • $200k - $285k

     ...enabling human life on Mars. PRINCIPAL SOFTWARE ENGINEER (PLATFORM TEAM) The Platform Team...  ...team at SpaceX to harness AI effectively. This team...  ..., this team unlocks reliable, high-impact AI capabilities...  ...critical to accelerating SpaceX production and development by making... 
    Permanent employment
    Temporary work

    SpaceX

    Redmond, WA
    4 days ago
  • $96k - $163k

     ...deliver a unique set of products and services that...  ...Summary Senior Site Reliability Engineer Job Description:...  ...steward for the platform.  This is accomplished...  ..., Dynatrace, etc, AI Tools – MS Copilot developer...  ...for promoting software into higher environments... 
    Full time
    Part time
    Immediate start
    Worldwide
    Flexible hours
    Shift work
    Weekend work

    Mastercard

    O Fallon, MO
    15 hours ago
  • $124k - $156k

     ...intelligence, while increasing productivity, visibility, accuracy,...  ....Job Description:As a Principal Software Engineer on the Platform Services team, you will...  ..., you will own the reliability, observability, and modernization...  ...state.You will bring an AI-first mindset —... 
    Work experience placement
    Immediate start

    insightsoftware

    New York, NY
    1 day ago
  •  ...Sr. Principal Architect - Platform Engineering Our world is transforming, and...  ...leading the way. Our software brings the physical...  ..., create better products, and empower people...  ...elevate performance, reliability, developer experience...  ...integration, AI agent and operational... 
    Local area
    Immediate start
    Flexible hours

    PTC

    San Ramon, CA
    2 days ago
  •  ...Principal Software Engineer, Architecture - Platform Modernization isolved is executing a live, in-production decomposition of a 7M+ line HCM monolith into...  ...ensure consistency and reliability across the platform...  ...organization Using AI-assisted development tools... 
    Live in

    iV Bars of New Jersey

    Charlotte, NC
    2 days ago
  • $147k - $210k

     ...The Tech Platform team is the backbone...  ...of Enova's engineering organization -...  ...that power our products, brands, and engineering...  ...and Platform Software Engineering,...  .... As a Principal Engineer on...  ...responsible use of AI across the...  ...improve platform reliability, developer... 
    Full time
    Summer work
    Work at office
    Local area
    Remote work
    Monday to Friday

    Enova

    Chicago, IL
    1 day ago
  • $168.93k - $192.5k

     ...Overview We are seeking a Site Reliability Engineer to join our Core Platform Engineering organization....  ...and security in an AI-accelerated development...  ...partnering closely with Software Engineering teams to foster...  ...Compliance teams to ensure production systems meet FedRAMP,... 
    Full time
    Temporary work
    Work at office
    Remote work
    Flexible hours

    ID.me

    Mountain View, CA
    11 hours ago
  • $249k

     ...world. Join us. Principal Software Development Engineer Our Technology...  ...create innovative products, services, and...  ...singular technology platform powered by cloud and...  ...without sacrificing reliability or skyrocketing our...  ...While we are embracing AI tools, your job is... 
    Local area
    Flexible hours
    Weekend work

    Expedia Group

    San Jose, CA
    1 day ago
  •  ...The Senior Principal Engineer will lead the technical...  ...high-performance platform tools, such as BioFlow...  ...building robust software engineering...  ...and integrating AI/Agent capabilities...  ...Partner closely with Product Managers to...  ...enterprise-grade reliability. Ensure optimal... 

    BeOne Medicines

    Emeryville, CA
    15 hours ago
  • $143k - $286k

     ...'ll do... As a Principal Engineer in Walmart's Fraud and Risk platform, you will define and...  ..., agentic AI, and human-in-the-...  ...Data scientists and Product managers who love...  ...requirements for latency, reliability, and scalability...  .... We're a team of software engineers, data... 
    Full time
    Temporary work
    Part time

    Walmart

    Sunnyvale, CA
    4 days ago
  •  ...Java Backend Engineer Lead the design, development...  ...our multi-tenant platform. Architect and optimize...  ...and engineering for reliability. Develop and...  ...guardrails for developer productivity and service reliability...  ...of experience in Software Engineering with a strong... 

    Saviynt

    San Francisco, CA
    15 hours ago
  • $210k - $230k

     ...RETAIL TRADING PLATFORM IN THE WORLD...  ...provide cutting-edge products and services that...  ...award-winning software and brokerage services...  ...platform. As a Principal Software Engineer, you will lead...  ...-time system reliability Bonus points...  ...Experience leveraging AI/ML tools or... 
    Work at office
    Remote work
    Worldwide
    Monday to Friday
    Flexible hours

    NinjaTrader

    Chicago, IL
    4 days ago
  • $168k - $247k

     ...Principal Software Engineer – Test Platform Boston, MA We're a high-tech home security...  ...ecosystem — spanning the full product stack: connected cameras...  ...journeys function reliably before reaching production...  ...leveraging data-driven and AI-assisted approaches to improve... 
    Work at office
    Shift work

    Venturefizz Product Management Community

    Boston, MA
    15 hours ago
  • $230k - $270k

     ...foundation for agent engineering in the real world,...  ...prototypes to production-ready AI agents that teams...  ...grown to also offer a platform for building,...  ...evaluation, and production reliability of AI systems....  ...'re looking for a Principal/Lead level Software Engineer to join... 
    Work at office
    Flexible hours

    LangChain, Inc

    Boston, MA
    15 hours ago
  • $147k - $237.5k

     ...and Inclusion. We weave AI into the fabric of...  ...is the observability platform built for control in the...  ...for an Infrastructure Engineer to help ensure all components...  ..., improving developer productivity across the company....  ...highly scalable and reliable distributed systems.... 
    Remote work

    Palo Alto Networks

    Boston, MA
    4 days ago
  • $126.9k - $215.3k

     ...06/03/2026 The Principal Engineer, Guest Communications Platform serves as the technical...  ...platform that reliably delivers Email, SMS...  ...broad influence across product, platform, DevOps,...  ...As a Principal Software Engineer within our...  ...~ Experience with AI coding and productivity... 
    Full time
    Remote work
    Flexible hours

    Marriott

    Bethesda, MD
    5 days ago
  • $210k - $247k

     ...traditional options like engines, turbines, and...  ...to quickly and reliably deliver local...  ...delivers its products to customers across...  ...-low emissions platform delivering...  ...opportunity for a Principal Cloud Platform Software Engineer to join...  ...infrastructure Embed AI into engineering... 
    Local area
    Remote work
    Flexible hours

    Mainspring Energy

    Menlo Park, CA
    1 day ago
  • $184.3k - $247.1k

     ...Principal Software Engineer - Ad Platform Join to apply for the Principal Software Engineer...  ...streaming and digital products in new and immersive ways...  ...for performance, cost, and reliability, leveraging cutting-edge...  ..., started with the help of AI. #J-18808-Ljbffr
    Hourly pay
    16 hours
    Full time
    Remote work
    Worldwide

    RemoteWorker US

    Santa Monica, CA
    3 days ago
  •  ...Kai is the AI company rebuilding cybersecurity...  .... The Kai Agentic AI Platform replaces fragmented,...  ...: Our Heads of AI, Engineering, and Product bring extensive experience...  ...to "it scales, it's reliable, it's maintainable,...  ...of professional software engineering experience... 
    Contract work

    Kai Cyber, Inc.

    San Jose, CA
    1 day ago
  • $190k - $200k

     ...Description Principal Software Engineer - AI Platform Full-time Remote Exclusive confidential search - details shared with qualified...  ...capabilities that enable scalable, secure, and reliable AI-powered product experiences. You will influence cross-team... 
    Hourly pay
    Full time
    Remote work

    NextDeavor

    Irvine, CA
    15 hours ago
  • $2,000 per month

     ...Elastic, the Search AI Company, enables...  ...Elastic Search AI Platform, used by more than...  ...part of the Platform Engineering department, the...  ...to guarantee the reliability of the global Elastic...  ...and maintaining software, codebases, tooling...  ...Engineering with product success in delivering... 
    Local area
    Remote work
    Flexible hours

    Elasticsearch B.V.

    New York, NY
    3 days ago
  •  ...Riot engineers bring deep knowledge of...  ...domains. As a Software Engineer, you'll...  ...value. As a Principal Software Engineer...  ...-by providing reliable, scalable frameworks...  ...quality of our products. We...  ...use-cases and platforms. ~ Have been...  ...machine learning, AI, and/or functional... 
    Local area
    Flexible hours

    Riot Games

    Los Angeles, CA
    15 hours ago
  •  ...of this with something engineers actually trust and...  ...are seeking a Senior Platform Engineer, Delivery & Reliability to own the trusted software delivery and reliability...  ...platform that enables product teams to move from idea...  ...or build scripts. As AI increasingly generates... 

    RDI Technologies Inc

    Knoxville, TN
    15 hours ago
  • $168.93k - $192.5k

     ...Overview We are seeking a Site Reliability Engineer to join our Core Platform Engineering organization....  ...and security in an AI-accelerated development...  ...partnering closely with Software Engineering teams to foster...  ...Compliance teams to ensure production systems meet FedRAMP,... 
    Full time
    Temporary work
    Work at office
    Remote work
    Flexible hours

    ID.me

    Mountain View, CA
    20 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal Software Engineer - AI Platform (Production Engineering / Reliability). Be the first to apply!