Principal Platform Engineer, Observability (CIPE)

$147k - $237.5k

Full-time

Palo Alto Networks

Our Mission At Palo Alto Networks®, we’re united by a shared mission—to protect our digital way of life. We thrive at the intersection of innovation and impact, solving real-world problems with cutting-edge technology and bold thinking. Here, everyone has a voice, and every idea counts. If you’re ready to do the most meaningful work of your career alongside people who are just as passionate as you are, you’re in the right place. Who We Are In order to be the cybersecurity partner of choice, we must trailblaze the path and shape the future of our industry. This is something our employees work at each day and is defined by our values: Disruption, Collaboration, Execution, Integrity, and Inclusion. We weave AI into the fabric of everything we do and use it to augment the impact every individual can have. If you are passionate about solving real-world problems and ideating beside the best and the brightest, we invite you to join us! We believe collaboration thrives in person. That’s why most of our teams work from the office full time, with flexibility when it’s needed. This model supports real-time problem-solving, stronger relationships, and the kind of precision that drives great outcomes. Job Summary Your Career We are looking for a Principal Software Engineer to architect, build, and evolve our observability platform across infrastructure, applications, and developer workflows. This role is ideal for a hands-on technical leader with deep experience in open source observability technologies and Chronosphere, who is equally fluent in building AI-enabled systems and developer experiences using modern AI coding tools such as Claude and Codex. You will serve as a technical architect for the observability stack, working across engineering, platform, SRE, and product teams to define standards for metrics, logs, traces, profiling, synthetics, alerting, dashboards, and incident response. You will also lead the integration of AI agents, copilots, and skill-based automation into observability workflows — making telemetry, debugging, and reliability operations equally consumable by humans and AI agents. You should be comfortable operating at both strategic and implementation levels: designing architecture, writing production-grade code, reviewing systems, mentoring engineers, and driving adoption across teams. Your Impact Observability Architecture Design and lead the evolution of a modern observability platform using OpenTelemetry, Prometheus, Jaeger, Alertmanager, and related CNCF ecosystem tools. Define architecture standards for telemetry collection, processing, storage, querying, visualization, alerting, retention, and governance. Build scalable systems for metrics, distributed tracing, continuous profiling, log aggregation, synthetic monitoring, service health monitoring, and reliability analytics. Establish best practices for instrumentation across services, infrastructure, Kubernetes workloads, CI/CD systems, and developer platforms. Evaluate trade-offs around data cardinality, sampling, storage cost, retention, query performance, multi-tenancy, reliability, and operational complexity. Make pragmatic recommendations on open source, self-managed, managed-service, and hybrid observability approaches. Create paved-road observability patterns that help engineering teams instrument, monitor, debug, and operate services with minimal friction. OpenTelemetry and Instrumentation Lead adoption and standardization of OpenTelemetry across applications, services, infrastructure, and platform components. Design and implement telemetry pipelines using OpenTelemetry Collector, exporters, processors, receivers, connectors, and custom extensions where needed. Define conventions for traces, metrics, logs, spans, attributes, resources, service names, correlation IDs, and semantic conventions. Build libraries, SDK wrappers, golden paths, and internal tooling to simplify observability instrumentation for engineering teams. Metrics, Monitoring, and Alerting Architect metrics systems using Prometheus-compatible formats, PromQL, remote write, federation, scraping strategies, service discovery, recording rules, and long-term storage backends. Design alerting frameworks that reduce noise, improve signal quality, and align with SLOs, SLIs, error budgets, and incident response practices. Create reusable alerting patterns for Kubernetes, infrastructure, applications, APIs, databases, queues, event-driven systems, and distributed services. Define standards for dashboarding, runbooks, escalation policies, alert ownership, and production readiness. Partner with SRE and engineering teams to mature monitoring practices and improve service reliability. Kubernetes and Platform Engineering Build observability capabilities for Kubernetes environments, including cluster monitoring, workload telemetry, service mesh visibility, ingress and egress monitoring, and node-level insights. Develop and maintain Helm charts, Kubernetes manifests, operators, sidecars, agents, DaemonSets, and deployment automation for observability components. Work with platform teams to ensure observability systems are reliable, secure, multi-tenant, highly available, and easy to operate. Define standards for resource usage, scaling, upgrades, failover, backup, disaster recovery, access control, and tenant isolation for observability infrastructure. Support observability across multi-cluster, multi-region, and hybrid cloud environments where applicable. AI-Enabled Observability and Developer Experience Design and build AI-enabled observability workflows that allow both humans and AI agents to investigate incidents, query telemetry, summarize signals, and propose remediations. Define and publish reusable AI skills, agents, and tools (e.g., Claude skills, Codex tools, MCP servers, structured prompts) that encode observability best practices and make platform capabilities consumable by engineering teams and autonomous agents. Build paved-road AI integrations for triage, alert summarization, root-cause analysis, log/trace exploration, runbook generation, dashboard authoring, and post-incident review. Establish standards for grounding AI agents in authoritative telemetry, runbooks, and service catalogs, with strong guardrails around accuracy, safety, cost, and auditability. Use AI coding tools (Claude, Codex, and equivalents) as a first-class part of the engineering workflow — for code generation, refactoring, instrumentation rollouts, migrations, and platform automation — and define patterns the broader team can adopt. Partner with platform, SRE, and product teams to evolve observability from human-only dashboards toward agent-assisted, self-serve reliability operations. Qualifications Your Experience 7+ years of software engineering, platform engineering, infrastructure engineering, or SRE experience, with significant experience building production-grade distributed systems. Deep hands-on experience with observability systems, including metrics, logs, traces, profiling, dashboards, synthetics, alerting, and incident workflows. Strong expertise with OpenTelemetry, including SDKs, Collector pipelines, exporters, processors, receivers, semantic conventions, and instrumentation patterns. Strong experience with Prometheus-compatible metrics, Alertmanager, scraping, cardinality management, federation, and remote write patterns. Hands-on experience with distributed tracing systems such as Jaeger or similar technologies. Experience with continuous profiling technologies. Strong experience with synthetic monitoring and proactive availability testing, including API checks, browser-based checks, blackbox monitoring, dependency checks, and integration with alerting and SLO workflows. Strong Kubernetes experience, including workload monitoring, service discovery, operators/controllers, Helm, resource management, cluster observability, and multi-tenant platform patterns. Strong Python engineering skills, including building internal tools, automation, integrations, services, and instrumentation libraries. Hands-on experience building real solutions, tools, and developer workflows using modern AI coding agents such as Claude, Codex, or equivalent — including prompt design, skill/tool/MCP authoring, agent orchestration, and integrating LLMs into production engineering systems. Practical understanding of how to design AI-friendly platforms: structured APIs, machine-readable runbooks, telemetry schemas, and skills/tools that allow both humans and AI agents to operate observability effectively. Experience designing and operating high-scale, highly available infrastructure systems. Strong understanding of SLOs, SLIs, error budgets, incident response, on-call practices, production readiness, and reliability engineering principles. Experience writing clear technical design documents, RFCs, standards, operational runbooks, and architecture recommendations. Ability to influence teams through technical depth, collaboration, mentorship, and pragmatic decision-making. Technical Skills Observability: OpenTelemetry, Prometheus, Chronosphere, PromQL, Alertmanager, Grafana, Jaeger, OpenTelemetry Collector. Telemetry: Metrics, logs, traces, spans, profiles, exemplars, service maps, SLOs, SLIs, error budgets, correlation IDs, semantic conventions. Synthetics: Grafana k6, Prometheus Blackbox Exporter, Playwright, Selenium, API monitoring, browser checks, checks, gRPC checks, DNS/TCP/TLS checks, synthetic user journeys. Kubernetes: Helm, operators, controllers, CRDs, DaemonSets, sidecars, service discovery, ingress, autoscaling, resource limits, multi-cluster observability. Programming: Python required; Go, Java, Rust, or Node.js preferred. AI Engineering: Claude, Codex, and equivalent coding agents; skill/tool/MCP authoring; prompt engineering; agent orchestration; LLM integration patterns; grounding, evaluation, and guardrails for AI-driven workflows. Infrastructure: Linux, containers, networking, distributed systems, cloud platforms, service mesh, load balancers, APIs, queues, databases. Automation: CI/CD, GitOps, Terraform, Argo CD, Flux, deployment pipelines, release validation, configuration management. Reliability: Incident response, alert tuning, runbooks, error budgets, capacity planning, performance optimization, disaster recovery, production readiness. Success in This Role Looks Like The organization has a clear, scalable observability architecture with strong standards for telemetry generation, collection, storage, querying, retention, and consumption. Engineering teams can easily instrument services and get useful metrics, traces, profiles, logs, dashboards, synthetic checks, and alerts without deep observability expertise. Alerting becomes more actionable, less noisy, and better aligned with service health, SLOs, and customer impact. Synthetic monitoring proactively detects failures in critical user journeys, APIs, infrastructure endpoints, and third-party dependencies before customers are significantly impacted. The observability platform is reliable, cost-efficient, secure, multi-tenant, and easy to operate across Kubernetes environments. Continuous profiling and tracing become part of normal performance, debugging, and reliability workflows. AI agents and skills are first-class consumers of the observability platform — accelerating triage, investigation, and remediation for both humans and autonomous workflows, with measurable improvements in MTTR and engineer productivity. The Principal Engineer is recognized as the technical leader who can connect architecture, implementation, operational excellence, developer experience, AI-enabled workflows, and business reliability outcomes across the observability stack. #LI-TD1 Compensation Disclosure The compensation offered for this position will depend on qualifications, experience, and work location. For candidates who receive an offer at the posted level, the starting base salary (for non-sales roles) or base salary + commission target (for sales/com-missioned roles) is expected to be the annual range listed below. The offered compensation may also include restricted stock units and a bonus. A description of our employee benefits may be found here. $147,000.00 - $237,500.00/yr Our Commitment We’re trailblazers that dream big, take risks, and challenge cybersecurity’s status quo. It’s simple: we can’t accomplish our mission without diverse teams innovating, together. We are committed to providing reasonable accommodations for all qualified individuals with a disability. If you require assistance or accommodation due to a disability or special need, please contact us at View email address on click.appcast.io. Palo Alto Networks is an equal opportunity employer. We celebrate diversity in our workplace, and all qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or other legally protected characteristics. All your information will be kept confidential according to EEO guidelines. Is role eligible for Immigration Sponsorship? No. Please note that we will not sponsor applicants for work visas for this position. Please use this form to provide us with information that will help direct your request and find your data in all of our systems

Apply

Vacancy posted 1 day ago

Similar jobs that could be interesting for youBased on the Principal Platform Engineer, Observability (CIPE) in California vacancy

Principal Platform Engineer
...ve invested heavily in forward-deployed engineers at the customer edge. Now we’re strengthening the hub. We’re hiring a Principal Platform Engineer to serve as a platform leader:... ..., APIs, caching, async processing, and observability. Review and de-risk change: Provide...
Suggested
Work experience placement
Work at office
Flexible hours
First Resonance
Los Angeles, CA
4 days ago
Principal Platform Engineer
$200k - $350k
...assembling a team of top researchers and engineers across AI and biology to build an AI scientist. Role As a Principal Platform Engineer, you'll play a key role in... ...Establish and uphold best practices around observability, monitoring, alerting, and incident response...
Suggested
Work at office
Edison Scientific
San Francisco, CA
2 days ago
Principal Identity Platform Engineer
DBA-Verwaltungs-Gmbh in San Diego is seeking a Principal Engineer in Identity Platform Engineering to define architectural frameworks and lead security... ...in distributed systems, API integration, and observability with a focus on ensuring a user-friendly platform that...
Suggested
DBA-Verwaltungs-Gmbh
San Diego, CA
2 days ago
Principal Data Platform Engineer
...Principal Data Platform Engineer The Principal Data Platform Engineer is a senior individual contributor who defines and owns the technical... ...authority. Establish platform patterns for data quality, observability, lineage, and reliability to ensure trust in downstream...
Suggested
Immediate start
MedRisk
Encino, CA
4 days ago
Principal Site Reliability Engineer (CIPE)
...: This role requires US Citizenship. Your Career As a Principal Site Reliability Engineer, you will serve as the technical authority for our cloud... ...in Python to eliminate manual toil and improve system observability. AI‑Driven Development : Utilize Cursor and Claude to...
Suggested
Visa sponsorship
Work visa
Shift work
Palo Alto Networks, Inc.
Santa Clara, CA
4 days ago
Principal AI Platform Engineer
The Principal AI Platform Engineer at Nextdata designs and builds interfaces, systems, and agents that make governed enterprise data usable by... ...patterns for agent‑readable metadata, structured outputs, observability, and evaluation. Partner with product, engineering, and...
Nextdata
San Francisco, CA
2 days ago
AI Observability Platform Lead - Engineering Manager
$200k - $250k
A leading AI technology company is hiring an Engineering Manager to lead the LangSmith platform development in San Francisco. This role involves mentoring a high-performing engineering team and partnering with product and design teams to deliver features for trustworthy...
LangChain
San Francisco, CA
1 day ago
Platform Engineer: ML Infra, Reliability & Observability
Zyphra in San Francisco is hiring a Platform Engineer responsible for designing and maintaining robust infrastructure. You will collaborate with teams to enhance system observability, manage cloud environments and ensure deployment safety. The ideal candidate has strong...
Zyphra
San Francisco, CA
2 days ago
Platform Engineering Manager
...for the modern world. Our cloud-native platform uses computer vision and AI to help businesses... ...are looking for a technically strong Engineering Manager to lead our Platform team. This... ...strong engineering processes around observability, testing, and production stability Hire...
Shift work
Coram AI
Sunnyvale, CA
1 day ago
Principal Platform Engineer
$175k - $200k
...Principal Platform Engineer Lead the end-to-end development and integration of the satellite bus for mission success Location: Los Angeles Compensation: $175,000 - 200,000 USD / year Job Tags: Software About The Role K2 is building the largest and highest-power satellites...
Permanent employment
Contract work
Shift work
jobs.frontdoordefense.com - Jobboard
Los Angeles, CA
1 day ago
Staff AI Platform Engineer - Observability & SaaS
...Palo Alto is looking for a seasoned professional to join their engineering team. The candidate will design and build core services for... ..., is crucial for contributing to the development of advanced observability tools for AI solutions. This position offers top-tier compensation...
Fiddler AI
Palo Alto, CA
1 day ago
Senior AI Platform Engineer: Agent Infra & Observability
Watershed is seeking experienced software engineers to build the AI platform that powers its emissions measurement and decarbonization products.... ...lead the design of the agent infrastructure, focusing on observability and reliable systems development. This role requires 6+...
Work at office
Watershed
San Francisco, CA
4 days ago
Senior Data Platform & Observability Engineer - Equity
$184k - $287.5k
NVIDIA Gruppe is seeking a Senior System Software Engineer to lead the development of their next-generation Data & Observability Platform in Santa Clara, California. This role focuses on high-performance ingestion, governance systems, and user experience improvements while...
NVIDIA Gruppe
Santa Clara, CA
23 hours ago
Principal Platform Engineer
...Overview: • Platform Architecture: • Lead the architectural design and hands-on implementation of our transition from a monolithic... ...needs, and technical considerations • Mentor and guide the engineering team through the challenges of distributed systems development...
Purple Drive
San Jose, CA
2 days ago
Head of Platform Engineering
...comes with limited risk and unlimited upside. As a Head of Platform Engineering at TENEX, you will be a strategic leader responsible for... ...: Drive the platform, implementing advanced monitoring, observability (logs, metrics, tracing), automated provisioning, and...
Tenex.AI Inc
San Jose, CA
23 hours ago
Engineering Manager, Platform Engineering
$188k - $235k
...practice in life. About the Role We are seeking an Engineering Manager, Platform Engineering to lead teams responsible for building and... ...operate and evolve systems at scale. Strengthen observability and operational practices , including monitoring, logging...
Alo
San Ramon, CA
1 day ago
Staff Data Platform Engineer - Observability & Integrations
...California seeks an experienced Staff Software Engineer to lead the technical direction of their data collection platforms. You will design systems for high-quality... ...with multiple teams, and oversee integrations, observability tools, and best engineering practices. A strong...
General Motors
Sunnyvale, CA
1 day ago
Platform Engineering Manager
...reduce friction and cognitive load for engineering teams. Establish clear “Golden Paths” that... .... Act as a key connector between platform engineering, architecture, security and... ...DevSecOps practices across the full SDLC. Observability concepts including metrics, logs, traces...
Local area
504 CGCG-US CG Companies Global-US
Irvine, CA
1 day ago
Principal DevOps Engineer (Prisma Browser Platform)
$147k - $237.5k
...outcomes. Job Summary The Prisma Browser Platform plays a critical role in today’s... ...in the modern workplace. As a DevOps Engineer within the Prisma Browser Group you will... ...Kubernetes CI/CD: ArgoCD and GitOps practices Observability: Datadog & Chronosphere platform In...
Full time
Work experience placement
Work at office
Remote work
Palo Alto Networks, Inc.
Santa Clara, CA
2 days ago
Principal Data Platform Engineer
...A technology solutions company seeks a skilled Back-End Engineer in Sunnyvale, California. You will design and maintain scalable data pipelines and back-end systems using Spark, Python, and Java. The role demands collaboration with data scientists and knowledge of cloud...
Robotics Prcocess Automation, LLC
Sunnyvale, CA
1 day ago
Senior Engineering Manager, SDN Platform & Observability
Crusoe in San Francisco is seeking a Senior Engineering Manager for SDN Management Plane to enhance their AI infrastructure. This leadership role involves overseeing software automation systems and ensuring high-performance engineering practices. The ideal candidate will...
Crusoe
San Francisco, CA
2 days ago
Platform Engineering Manager, Forward Deployed Engineering (FDE)
$302k - $335k
Platform Engineering Manager, Forward Deployed Engineering (FDE) Model Deployment for Business - San Francisco Apply now (opens in a new... ...and enforce standards for design, code quality, testing, observability, and operational readiness appropriate for enterprise use....
Work at office
Relocation package
OpenAI
Los Angeles, CA
2 days ago
Principal Engineer, Cloud Platforms
$235k - $250k
Saviynt's AI-powered identity platform manages and governs human and non-human access to... ...at scale Influence architecture and engineering culture at a company level Competitive... ...consumers. Establish and enhance centralized Observability and Monitoring platforms and tools that...
Medium
Milpitas, CA
2 days ago
Principal Engineer, AI Platform & Infrastructure
Principal Engineer, AI Platform & Infrastructure About the Role SPREEAI is building the future of AI-powered commerce through photorealistic virtual... ...to build the infrastructure, deployment pipelines, and observability systems that enable multimodal AI models to move from...
SpreeAI
San Francisco, CA
2 days ago
Senior Platform Engineering Manager — Reliability & DevX
Vercel is seeking a Sr. Engineering Manager to lead the Reliability & Resilience and Developer Productivity... ...strategy, execution, and leadership across core platform domains including CI/CD, Kubernetes, and observability. The ideal candidate has over 3 years of management...
Flexible hours
vercel.com
San Francisco, CA
23 hours ago
Principal Platform Engineer: Secure, Scalable Infra
The Walt Disney Company (France) is seeking a proactive Server Engineer to support management and compliance needs at Walt Disney... ...maintenance and security of server infrastructure across various platforms. The ideal candidate needs over 10 years of experience in relevant...
Full time
The Walt Disney Company (France)
Glendale, CA
23 hours ago
Principal Platform Engineer, AI Infra & Cloud
$280k - $350k
Join careers.bitkraft.vc as a Staff / Principal Platform Engineer in Mountain View, California. In this role, you will take ownership of building and scaling AI products, manage cloud infrastructure, and enhance engineering workflows. With a hybrid work model, we seek an...
Relocation
careers.bitkraft.vc - Jobboard
Mountain View, CA
1 day ago
Remote Observability Engineering Manager
A leading design platform seeks an experienced Engineering Manager for Observability. In this role, you will lead a team to enhance the visibility and efficiency of the platform, optimizing costs and implementing innovative AI-driven solutions. The ideal candidate will...
Remote job
Full time
Figma
San Francisco, CA
23 hours ago
Engineering Manager: Real-Time Delivery & Observability
Netflix, Inc. is seeking an experienced Engineering Manager to lead the Client Delivery & Observability (CDO) team. In this role, you will ensure every client release, server canary, and A/B test is safely delivered while building a high-performing team of engineers. Responsibilities...
Flexible hours
Netflix, Inc.
Los Gatos, CA
2 days ago
Site Reliability Engineer: Platform & Observability
A leading technology company is seeking a Site Reliability Engineer in Cupertino, California. The role involves owning the reliability... ...systems, and collaborating with engineering teams for observability and automation. Candidates should have substantial experience...
Apple Inc.
Cupertino, CA
2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal Platform Engineer, Observability (CIPE). Be the first to apply!