Principal Software Engineer - AI Platform (Production Engineering / Reliability)
$144.2k - $288.4kCVS Health
We're building a world of health around every individual - shaping a more connected, convenient and compassionate health experience. At CVS Health®, you'll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger - helping to simplify health care one person, one family and one community at a time.
OverviewWe are seeking a Principal Individual Contributor (IC) to lead production engineering, observability, and operational excellence for our AI Platform. This role sits at the intersection of ML systems, distributed infrastructure, and production reliability , ensuring that our AI services are scalable, observable, and resilient in real-world environments .
As a senior technical leader, you will define and drive best-in-class production practices , build robust monitoring and alerting ecosystems, and partner across engineering, ML, and platform teams to ensure mission-critical AI systems meet high availability, performance, and reliability standards . Key Responsibilities
Production Reliability & Operations Leadership
- Own and evolve production operations strategy for AI/ML platforms and services
- Define SLOs, SLIs, and error budgets for AI systems (online & batch/inference pipelines)
- Lead root cause analysis (RCA) and drive systemic improvements post-incident
- Establish operational readiness standards for launching new AI capabilities
- Build frameworks for on-call excellence , incident response, and escalation
- Design and implement end-to-end observability systems across AI workloads:
- Model performance monitoring
- Data pipeline health
- Infrastructure metrics
- Build and scale monitoring and alerting frameworks using modern tooling (e.g., Prometheus, Grafana, OpenTelemetry, Datadog, Azure Monitor, etc.)
- Define actionable, low-noise alerts tied to business and system impact
- Develop dashboards and telemetry standards for real-time visibility across services
- Drive adoption of golden signals (latency, errors, throughput, saturation) in AI systems
- Ensure reliable deployment and operation of:
- Real-time inference services
- Model pipelines (training, validation, deployment)
- Data ingestion and feature pipelines
- Implement model observability (drift detection, data skew, performance degradation)
- Partner with ML engineers to improve production readiness of models
- Establish lifecycle standards for models in production environments
- Build internal platforms and tooling for:
- Automated incident detection and response
- Self-healing systems
- Deployment validation and canarying
- Drive Infrastructure as Code (IaC) and policy automation
- Improve system resilience through chaos testing and fault injection
- Act as a trusted technical advisor across platform, ML, and product teams
- Set direction for operational excellence in AI systems at org scale
- Mentor senior engineers and influence cross-team architectural decisions
- Lead adoption of industry best practices in reliability engineering and observability
- 10+ years in software engineering, production engineering, or SRE roles
- Deep experience operating large-scale distributed systems in production
- Proven track record building monitoring, observability, and alerting systems
- Strong expertise in incident management and production support models
- Experience working with cloud platforms (Azure, AWS, GCP)
- Experience supporting AI/ML platforms or data-intensive systems
- Familiarity with model lifecycle management and MLOps practices
- Knowledge of:
- OpenTelemetry, Prometheus, Grafana, Datadog
- Kubernetes and containerized workloads
- Streaming systems (Kafka, Event Hub, etc.)
- Experience defining and implementing SLO-driven engineering
- Background in high-availability, low-latency systems
- Systems thinking and ability to reason about complex, interdependent systems
- Strong bias for automation, scalability, and long-term solutions
- Exceptional debugging and incident management skills
- Ability to influence without authority across multiple teams
- Passion for operational excellence and reliability
Our people fuel our future. Our teams reflect the customers, patients, members and communities we serve and we are committed to fostering a workplace where every colleague feels valued and that they belong. Great benefits for great people We take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families. This full-time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well-being of colleagues and their families. The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility. Additional details about available benefits are provided during the application process and on Benefits Moments. We anticipate the application window for this opening will close on: 06/04/2026 Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state and local laws.
- ...Senior UiPath Platform Engineer - Automation & Cloud Reliability Location: Charlotte NC (onsite) Duration: 12+ months... ...Automation Hub, Insights, Agentic AI integrations, and related... ...analysis and permanent remediation for production incidents. 2. Cloud, Infrastructure...SuggestedPermanent employment
$166k - $220k
...systems is powered by Lattice OS, an AI-powered operating system that... ...years. ABOUT THE TEAM The Production Engineering team is a newly formed organization within Anduril's Software Platform, dedicated to ensuring the reliability, performance, and scalability of...SuggestedFull timeWork experience placementImmediate start$150.9k - $195.9k
...cancer. The Senior Principal Engineer will lead the... ...high-performance platform tools, such as... ...building robust software engineering foundations... ...and integrating AI/Agent... ...Partner closely with Product Managers to define... ...enterprise-grade reliability. Ensure optimal...SuggestedHourly payLocal areaRemote work$145.6k - $209.3k
...workforce operating platform. Helping people get... ...We are seeking a Principal Cloud Platform Software Engineer in Enterprise Solutions... ...to ensure system reliability and performance.... ...Participate in production support, troubleshooting... ..., and people-first AI, our ability to reveal...SuggestedRemote work- ...Java Backend Engineer Lead the design, development... ...our multi-tenant platform. Architect and optimize... ...and engineering for reliability. Develop and... ...guardrails for developer productivity and service reliability... ...of experience in Software Engineering with a strong...Suggested
$143k - $286k
...'ll do... As a Principal Engineer in Walmart's Fraud and Risk platform, you will define and... ..., agentic AI, and human-in-the-... ...Data scientists and Product managers who love... ...requirements for latency, reliability, and scalability... .... We're a team of software engineers, data...Full timeTemporary workPart time$119.8k - $234.7k
...mission to redefine how software is built and... ...the foundational platforms, services, and developer... ...generation of AI-driven... ...that accelerates product learning and drives... ...services that empower engineers and scientists across... ...systems, service reliability, and experimentation...Ongoing contractLocal area- ...Principal Software Engineer Join a forward-thinking team at JPMorgan... ...the future of cloud platform engineering. As a... ...power our data and AI initiatives. You'll... ...solutions that are secure, reliable, and scalable. This... ...improve developer productivity through clear...Work at officeShift work
- ...Job Description: Principal Software Engineer- Contact Center Platform Note: Fidelity will not provide sponsorship... ...speech recognition and conversational AI integrations. You will build... ...operates on a 24x7 basis—making reliability, scalability, and performance critical...
$249k
...world. Join us. Principal Software Development Engineer Our Technology... ...create innovative products, services, and... ...singular technology platform powered by cloud and... ...without sacrificing reliability or skyrocketing our... ...While we are embracing AI tools, your job is...Local areaFlexible hoursWeekend work$96k - $163k
...deliver a unique set of products and services that... ...Summary Senior Site Reliability Engineer Job Description:... ...steward for the platform. This is accomplished... ..., Dynatrace, etc, AI Tools – MS Copilot developer... ...for promoting software into higher environments...Full timePart timeImmediate startWorldwideFlexible hoursShift workWeekend work$124k - $156k
...intelligence, while increasing productivity, visibility, accuracy,... ....Job Description:As a Principal Software Engineer on the Platform Services team, you will... ..., you will own the reliability, observability, and modernization... ...state.You will bring an AI-first mindset —...Work experience placementImmediate start$210k - $295k
...enabling human life on Mars. PRINCIPAL SOFTWARE ENGINEER (PLATFORM TEAM) The Platform Team... ...team at SpaceX to harness AI effectively. This team... ...infrastructure, this team unlocks reliable, high-impact AI... ...critical to accelerating SpaceX production and development by making...Permanent employmentTemporary work- ...Principal Software Engineer, Architecture - Platform Modernization isolved is executing a live, in-production decomposition of a 7M+ line HCM monolith into... ...ensure consistency and reliability across the platform... ...organization Using AI-assisted development tools...Live in
- ...Sr. Principal Architect - Platform Engineering Our world is transforming, and... ...leading the way. Our software brings the physical... ..., create better products, and empower people... ...elevate performance, reliability, developer experience... ...integration, AI agent and operational...Local areaImmediate startFlexible hours
$184.3k - $247.1k
...Principal Software Engineer - Ad PlatformJoin to apply for the Principal... ...Engineer - Ad Platform role at RemoteWorker... ...streaming and digital products in new and immersive... ...performance, cost, and reliability, leveraging cutting-... ...with the help of AI.J-18808-Ljbffr...Hourly pay16 hoursRemote workWorldwide$168k - $247k
...organization, the Test Platform Services (TPS)... ...spanning the full product stack: connected... ..., and distributed software systems to ensure... ...journeys function reliably before reaching production... ...a Software Engineer to lead TPS's technical... ...data-driven and AI-assisted...Work at officeShift work$210k - $247k
...traditional options like engines, turbines, and... ...to quickly and reliably deliver local... ...delivers its products to customers across... ...-low emissions platform delivering... ...opportunity for a Principal Cloud Platform Software Engineer to join... ...infrastructure Embed AI into engineering...Local areaRemote workFlexible hours$230k - $270k
...foundation for agent engineering in the real world,... ...prototypes to production-ready AI agents that teams... ...grown to also offer a platform for building,... ...evaluation, and production reliability of AI systems.... ...'re looking for a Principal/Lead level Software Engineer to join...Work at officeFlexible hours$210k - $230k
...RETAIL TRADING PLATFORM IN THE WORLD... ...provide cutting-edge products and services that... ...award-winning software and brokerage services... ...platform. As a Principal Software Engineer, you will lead... ...-time system reliability Bonus points... ...Experience leveraging AI/ML tools or...Work at officeRemote workWorldwideMonday to FridayFlexible hours$190k - $200k
...Description Principal Software Engineer - AI Platform Full-time Remote Exclusive confidential search - details shared with qualified... ...capabilities that enable scalable, secure, and reliable AI-powered product experiences. You will influence cross-team...Hourly payFull timeRemote work$2,000 per month
...Elastic, the Search AI Company, enables... ...Elastic Search AI Platform, used by more than... ...part of the Platform Engineering department, the... ...to guarantee the reliability of the global Elastic... ...and maintaining software, codebases, tooling... ...Engineering with product success in delivering...Local areaRemote workFlexible hours- ...Kai is the AI company rebuilding cybersecurity... .... The Kai Agentic AI Platform replaces fragmented,... ...: Our Heads of AI, Engineering, and Product bring extensive experience... ...to "it scales, it's reliable, it's maintainable,... ...of professional software engineering experience...Contract work
$261.5k - $353.5k
...a global technology platform that helps our customers... ...units, is seeking a Principal Software Engineer to lead the long-term... ...distributed systems, AI/GenAI, and financial... ...that must be highly reliable, explainable, and globally... ...across engineering, product, data science,...Local areaWorldwide$126.9k - $215.3k
...06/03/2026 The Principal Engineer, Guest Communications Platform serves as the technical... ...platform that reliably delivers Email, SMS... ...broad influence across product, platform, DevOps,... ...As a Principal Software Engineer within our... ...~ Experience with AI coding and productivity...Full timeRemote workFlexible hours$147k - $210k
...The Tech Platform team is the backbone... ...of Enova's engineering organization -... ...that power our products, brands, and engineering... ...and Platform Software Engineering,... .... As a Principal Engineer on... ...responsible use of AI across the... ...improve platform reliability, developer...Full timeSummer workWork at officeLocal areaRemote workMonday to Friday- ...Senior Principal Software Engineer We're looking for a tech leader... ...Global Customer Platform, you provide deep engineering... ...-leading technology products in a secure, stable,... ...Experience applying AI/ML technologies to... ..., and production reliability practices for...
$168.93k - $192.5k
...Overview We are seeking a Site Reliability Engineer to join our Core Platform Engineering organization.... ...and security in an AI-accelerated development... ...partnering closely with Software Engineering teams to foster... ...Compliance teams to ensure production systems meet FedRAMP,...Full timeTemporary workWork at officeRemote workFlexible hours- ...San Francisco is seeking a DevOps Engineer to enhance the reliability of their production systems. You will collaborate... ...in observability stacks and cloud platforms. Join us in our mission to revolutionize... ...design through innovative AI solutions. #J-18808-Ljbffr Flux Enterprise
$147k - $237.5k
...and Inclusion. We weave AI into the fabric of... ...is the observability platform built for control in the... ...for an Infrastructure Engineer to help ensure all components... ..., improving developer productivity across the company.... ...highly scalable and reliable distributed systems....Remote work
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to Principal Software Engineer - AI Platform (Production Engineering / Reliability). Be the first to apply!
- senior principal software engineer United States
- principal software engineer manager United States
- principal software engineer United States
- client platform engineer United States
- senior platform engineer United States
- data platform engineer United States
- platform engineering manager United States
- platform developer United States
- platform engineer United States
- machine learning ai engineer United States


