Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Principal Platform Engineer, Observability (CIPE)

$147k - $237.5k
Full-time

Palo Alto Networks

Our Mission At Palo Alto Networks®, we’re united by a shared mission—to protect our digital way of life. We thrive at the intersection of innovation and impact, solving real-world problems with cutting-edge technology and bold thinking. Here, everyone has a voice, and every idea counts. If you’re ready to do the most meaningful work of your career alongside people who are just as passionate as you are, you’re in the right place. Who We Are In order to be the cybersecurity partner of choice, we must trailblaze the path and shape the future of our industry. This is something our employees work at each day and is defined by our values: Disruption, Collaboration, Execution, Integrity, and Inclusion. We weave AI into the fabric of everything we do and use it to augment the impact every individual can have. If you are passionate about solving real-world problems and ideating beside the best and the brightest, we invite you to join us! We believe collaboration thrives in person. That’s why most of our teams work from the office full time, with flexibility when it’s needed. This model supports real-time problem-solving, stronger relationships, and the kind of precision that drives great outcomes. Job Summary Your Career We are looking for a Principal Software Engineer to architect, build, and evolve our observability platform across infrastructure, applications, and developer workflows. This role is ideal for a hands-on technical leader with deep experience in open source observability technologies and Chronosphere, who is equally fluent in building AI-enabled systems and developer experiences using modern AI coding tools such as Claude and Codex. You will serve as a technical architect for the observability stack, working across engineering, platform, SRE, and product teams to define standards for metrics, logs, traces, profiling, synthetics, alerting, dashboards, and incident response. You will also lead the integration of AI agents, copilots, and skill-based automation into observability workflows — making telemetry, debugging, and reliability operations equally consumable by humans and AI agents. You should be comfortable operating at both strategic and implementation levels: designing architecture, writing production-grade code, reviewing systems, mentoring engineers, and driving adoption across teams. Your Impact Observability Architecture Design and lead the evolution of a modern observability platform using OpenTelemetry, Prometheus, Jaeger, Alertmanager, and related CNCF ecosystem tools. Define architecture standards for telemetry collection, processing, storage, querying, visualization, alerting, retention, and governance. Build scalable systems for metrics, distributed tracing, continuous profiling, log aggregation, synthetic monitoring, service health monitoring, and reliability analytics. Establish best practices for instrumentation across services, infrastructure, Kubernetes workloads, CI/CD systems, and developer platforms. Evaluate trade-offs around data cardinality, sampling, storage cost, retention, query performance, multi-tenancy, reliability, and operational complexity. Make pragmatic recommendations on open source, self-managed, managed-service, and hybrid observability approaches. Create paved-road observability patterns that help engineering teams instrument, monitor, debug, and operate services with minimal friction. OpenTelemetry and Instrumentation Lead adoption and standardization of OpenTelemetry across applications, services, infrastructure, and platform components. Design and implement telemetry pipelines using OpenTelemetry Collector, exporters, processors, receivers, connectors, and custom extensions where needed. Define conventions for traces, metrics, logs, spans, attributes, resources, service names, correlation IDs, and semantic conventions. Build libraries, SDK wrappers, golden paths, and internal tooling to simplify observability instrumentation for engineering teams. Metrics, Monitoring, and Alerting Architect metrics systems using Prometheus-compatible formats, PromQL, remote write, federation, scraping strategies, service discovery, recording rules, and long-term storage backends. Design alerting frameworks that reduce noise, improve signal quality, and align with SLOs, SLIs, error budgets, and incident response practices. Create reusable alerting patterns for Kubernetes, infrastructure, applications, APIs, databases, queues, event-driven systems, and distributed services. Define standards for dashboarding, runbooks, escalation policies, alert ownership, and production readiness. Partner with SRE and engineering teams to mature monitoring practices and improve service reliability. Kubernetes and Platform Engineering Build observability capabilities for Kubernetes environments, including cluster monitoring, workload telemetry, service mesh visibility, ingress and egress monitoring, and node-level insights. Develop and maintain Helm charts, Kubernetes manifests, operators, sidecars, agents, DaemonSets, and deployment automation for observability components. Work with platform teams to ensure observability systems are reliable, secure, multi-tenant, highly available, and easy to operate. Define standards for resource usage, scaling, upgrades, failover, backup, disaster recovery, access control, and tenant isolation for observability infrastructure. Support observability across multi-cluster, multi-region, and hybrid cloud environments where applicable. AI-Enabled Observability and Developer Experience Design and build AI-enabled observability workflows that allow both humans and AI agents to investigate incidents, query telemetry, summarize signals, and propose remediations. Define and publish reusable AI skills, agents, and tools (e.g., Claude skills, Codex tools, MCP servers, structured prompts) that encode observability best practices and make platform capabilities consumable by engineering teams and autonomous agents. Build paved-road AI integrations for triage, alert summarization, root-cause analysis, log/trace exploration, runbook generation, dashboard authoring, and post-incident review. Establish standards for grounding AI agents in authoritative telemetry, runbooks, and service catalogs, with strong guardrails around accuracy, safety, cost, and auditability. Use AI coding tools (Claude, Codex, and equivalents) as a first-class part of the engineering workflow — for code generation, refactoring, instrumentation rollouts, migrations, and platform automation — and define patterns the broader team can adopt. Partner with platform, SRE, and product teams to evolve observability from human-only dashboards toward agent-assisted, self-serve reliability operations. Qualifications Your Experience 7+ years of software engineering, platform engineering, infrastructure engineering, or SRE experience, with significant experience building production-grade distributed systems. Deep hands-on experience with observability systems, including metrics, logs, traces, profiling, dashboards, synthetics, alerting, and incident workflows. Strong expertise with OpenTelemetry, including SDKs, Collector pipelines, exporters, processors, receivers, semantic conventions, and instrumentation patterns. Strong experience with Prometheus-compatible metrics, Alertmanager, scraping, cardinality management, federation, and remote write patterns. Hands-on experience with distributed tracing systems such as Jaeger or similar technologies. Experience with continuous profiling technologies. Strong experience with synthetic monitoring and proactive availability testing, including API checks, browser-based checks, blackbox monitoring, dependency checks, and integration with alerting and SLO workflows. Strong Kubernetes experience, including workload monitoring, service discovery, operators/controllers, Helm, resource management, cluster observability, and multi-tenant platform patterns. Strong Python engineering skills, including building internal tools, automation, integrations, services, and instrumentation libraries. Hands-on experience building real solutions, tools, and developer workflows using modern AI coding agents such as Claude, Codex, or equivalent — including prompt design, skill/tool/MCP authoring, agent orchestration, and integrating LLMs into production engineering systems. Practical understanding of how to design AI-friendly platforms: structured APIs, machine-readable runbooks, telemetry schemas, and skills/tools that allow both humans and AI agents to operate observability effectively. Experience designing and operating high-scale, highly available infrastructure systems. Strong understanding of SLOs, SLIs, error budgets, incident response, on-call practices, production readiness, and reliability engineering principles. Experience writing clear technical design documents, RFCs, standards, operational runbooks, and architecture recommendations. Ability to influence teams through technical depth, collaboration, mentorship, and pragmatic decision-making. Technical Skills Observability: OpenTelemetry, Prometheus, Chronosphere, PromQL, Alertmanager, Grafana, Jaeger, OpenTelemetry Collector. Telemetry: Metrics, logs, traces, spans, profiles, exemplars, service maps, SLOs, SLIs, error budgets, correlation IDs, semantic conventions. Synthetics: Grafana k6, Prometheus Blackbox Exporter, Playwright, Selenium, API monitoring, browser checks, checks, gRPC checks, DNS/TCP/TLS checks, synthetic user journeys. Kubernetes: Helm, operators, controllers, CRDs, DaemonSets, sidecars, service discovery, ingress, autoscaling, resource limits, multi-cluster observability. Programming: Python required; Go, Java, Rust, or Node.js preferred. AI Engineering: Claude, Codex, and equivalent coding agents; skill/tool/MCP authoring; prompt engineering; agent orchestration; LLM integration patterns; grounding, evaluation, and guardrails for AI-driven workflows. Infrastructure: Linux, containers, networking, distributed systems, cloud platforms, service mesh, load balancers, APIs, queues, databases. Automation: CI/CD, GitOps, Terraform, Argo CD, Flux, deployment pipelines, release validation, configuration management. Reliability: Incident response, alert tuning, runbooks, error budgets, capacity planning, performance optimization, disaster recovery, production readiness. Success in This Role Looks Like The organization has a clear, scalable observability architecture with strong standards for telemetry generation, collection, storage, querying, retention, and consumption. Engineering teams can easily instrument services and get useful metrics, traces, profiles, logs, dashboards, synthetic checks, and alerts without deep observability expertise. Alerting becomes more actionable, less noisy, and better aligned with service health, SLOs, and customer impact. Synthetic monitoring proactively detects failures in critical user journeys, APIs, infrastructure endpoints, and third-party dependencies before customers are significantly impacted. The observability platform is reliable, cost-efficient, secure, multi-tenant, and easy to operate across Kubernetes environments. Continuous profiling and tracing become part of normal performance, debugging, and reliability workflows. AI agents and skills are first-class consumers of the observability platform — accelerating triage, investigation, and remediation for both humans and autonomous workflows, with measurable improvements in MTTR and engineer productivity. The Principal Engineer is recognized as the technical leader who can connect architecture, implementation, operational excellence, developer experience, AI-enabled workflows, and business reliability outcomes across the observability stack. #LI-TD1 Compensation Disclosure The compensation offered for this position will depend on qualifications, experience, and work location. For candidates who receive an offer at the posted level, the starting base salary (for non-sales roles) or base salary + commission target (for sales/com-missioned roles) is expected to be the annual range listed below. The offered compensation may also include restricted stock units and a bonus. A description of our employee benefits may be found here. $147,000.00 - $237,500.00/yr Our Commitment We’re trailblazers that dream big, take risks, and challenge cybersecurity’s status quo. It’s simple: we can’t accomplish our mission without diverse teams innovating, together. We are committed to providing reasonable accommodations for all qualified individuals with a disability. If you require assistance or accommodation due to a disability or special need, please contact us at View email address on click.appcast.io. Palo Alto Networks is an equal opportunity employer. We celebrate diversity in our workplace, and all qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or other legally protected characteristics. All your information will be kept confidential according to EEO guidelines. Is role eligible for Immigration Sponsorship? No. Please note that we will not sponsor applicants for work visas for this position. Please use this form to provide us with information that will help direct your request and find your data in all of our systems

Vacancy posted 1 day ago
Similar jobs that could be interesting for youBased on the Principal Platform Engineer, Observability (CIPE) in California vacancy
  •  ...ve invested heavily in forward-deployed engineers at the customer edge. Now we’re strengthening the hub. We’re hiring a Principal Platform Engineer to serve as a platform leader:...  ..., APIs, caching, async processing, and observability. Review and de-risk change: Provide... 
    Suggested
    Work experience placement
    Work at office
    Flexible hours

    First Resonance

    Los Angeles, CA
    4 days ago
  • $200k - $350k

     ...assembling a team of top researchers and engineers across AI and biology to build an AI scientist. Role As a Principal Platform Engineer, you'll play a key role in...  ...Establish and uphold best practices around observability, monitoring, alerting, and incident response... 
    Suggested
    Work at office

    Edison Scientific

    San Francisco, CA
    2 days ago
  • DBA-Verwaltungs-Gmbh in San Diego is seeking a Principal Engineer in Identity Platform Engineering to define architectural frameworks and lead security...  ...in distributed systems, API integration, and observability with a focus on ensuring a user-friendly platform that... 
    Suggested

    DBA-Verwaltungs-Gmbh

    San Diego, CA
    2 days ago
  •  ...Principal Data Platform Engineer The Principal Data Platform Engineer is a senior individual contributor who defines and owns the technical...  ...authority. Establish platform patterns for data quality, observability, lineage, and reliability to ensure trust in downstream... 
    Suggested
    Immediate start

    MedRisk

    Encino, CA
    4 days ago
  •  ...: This role requires US Citizenship. Your Career As a Principal Site Reliability Engineer, you will serve as the technical authority for our cloud...  ...in Python to eliminate manual toil and improve system observability. AI‑Driven Development : Utilize Cursor and Claude to... 
    Suggested
    Visa sponsorship
    Work visa
    Shift work

    Palo Alto Networks, Inc.

    Santa Clara, CA
    4 days ago
  • The Principal AI Platform Engineer at Nextdata designs and builds interfaces, systems, and agents that make governed enterprise data usable by...  ...patterns for agent‑readable metadata, structured outputs, observability, and evaluation. Partner with product, engineering, and... 

    Nextdata

    San Francisco, CA
    2 days ago
  • $200k - $250k

    A leading AI technology company is hiring an Engineering Manager to lead the LangSmith platform development in San Francisco. This role involves mentoring a high-performing engineering team and partnering with product and design teams to deliver features for trustworthy... 

    LangChain

    San Francisco, CA
    1 day ago
  • Zyphra in San Francisco is hiring a Platform Engineer responsible for designing and maintaining robust infrastructure. You will collaborate with teams to enhance system observability, manage cloud environments and ensure deployment safety. The ideal candidate has strong... 

    Zyphra

    San Francisco, CA
    2 days ago
  •  ...for the modern world. Our cloud-native platform uses computer vision and AI to help businesses...  ...are looking for a technically strong Engineering Manager to lead our Platform team. This...  ...strong engineering processes around observability, testing, and production stability Hire... 
    Shift work

    Coram AI

    Sunnyvale, CA
    1 day ago
  • $175k - $200k

     ...Principal Platform Engineer Lead the end-to-end development and integration of the satellite bus for mission success Location: Los Angeles Compensation: $175,000 - 200,000 USD / year Job Tags: Software About The Role K2 is building the largest and highest-power satellites... 
    Permanent employment
    Contract work
    Shift work

    jobs.frontdoordefense.com - Jobboard

    Los Angeles, CA
    1 day ago
  •  ...Palo Alto is looking for a seasoned professional to join their engineering team. The candidate will design and build core services for...  ..., is crucial for contributing to the development of advanced observability tools for AI solutions. This position offers top-tier compensation... 

    Fiddler AI

    Palo Alto, CA
    1 day ago
  • Watershed is seeking experienced software engineers to build the AI platform that powers its emissions measurement and decarbonization products....  ...lead the design of the agent infrastructure, focusing on observability and reliable systems development. This role requires 6+... 
    Work at office

    Watershed

    San Francisco, CA
    4 days ago
  • $184k - $287.5k

    NVIDIA Gruppe is seeking a Senior System Software Engineer to lead the development of their next-generation Data & Observability Platform in Santa Clara, California. This role focuses on high-performance ingestion, governance systems, and user experience improvements while... 

    NVIDIA Gruppe

    Santa Clara, CA
    23 hours ago
  •  ...Overview: • Platform Architecture: • Lead the architectural design and hands-on implementation of our transition from a monolithic...  ...needs, and technical considerations • Mentor and guide the engineering team through the challenges of distributed systems development... 

    Purple Drive

    San Jose, CA
    2 days ago
  •  ...comes with limited risk and unlimited upside. As a Head of Platform Engineering at TENEX, you will be a strategic leader responsible for...  ...: Drive the platform, implementing advanced monitoring, observability (logs, metrics, tracing), automated provisioning, and... 

    Tenex.AI Inc

    San Jose, CA
    23 hours ago
  • $188k - $235k

     ...practice in life. About the Role We are seeking an Engineering Manager, Platform Engineering to lead teams responsible for building and...  ...operate and evolve systems at scale. Strengthen observability and operational practices , including monitoring, logging... 

    Alo

    San Ramon, CA
    1 day ago
  •  ...California seeks an experienced Staff Software Engineer to lead the technical direction of their data collection platforms. You will design systems for high-quality...  ...with multiple teams, and oversee integrations, observability tools, and best engineering practices. A strong... 

    General Motors

    Sunnyvale, CA
    1 day ago
  •  ...reduce friction and cognitive load for engineering teams. Establish clear “Golden Paths” that...  .... Act as a key connector between platform engineering, architecture, security and...  ...DevSecOps practices across the full SDLC. Observability concepts including metrics, logs, traces... 
    Local area

    504 CGCG-US CG Companies Global-US

    Irvine, CA
    1 day ago
  • $147k - $237.5k

     ...outcomes. Job Summary The Prisma Browser Platform plays a critical role in today’s...  ...in the modern workplace. As a DevOps Engineer within the Prisma Browser Group you will...  ...Kubernetes CI/CD: ArgoCD and GitOps practices Observability: Datadog & Chronosphere platform In... 
    Full time
    Work experience placement
    Work at office
    Remote work

    Palo Alto Networks, Inc.

    Santa Clara, CA
    2 days ago
  •  ...A technology solutions company seeks a skilled Back-End Engineer in Sunnyvale, California. You will design and maintain scalable data pipelines and back-end systems using Spark, Python, and Java. The role demands collaboration with data scientists and knowledge of cloud... 

    Robotics Prcocess Automation, LLC

    Sunnyvale, CA
    1 day ago
  • Crusoe in San Francisco is seeking a Senior Engineering Manager for SDN Management Plane to enhance their AI infrastructure. This leadership role involves overseeing software automation systems and ensuring high-performance engineering practices. The ideal candidate will... 

    Crusoe

    San Francisco, CA
    2 days ago
  • $302k - $335k

    Platform Engineering Manager, Forward Deployed Engineering (FDE) Model Deployment for Business - San Francisco Apply now (opens in a new...  ...and enforce standards for design, code quality, testing, observability, and operational readiness appropriate for enterprise use.... 
    Work at office
    Relocation package

    OpenAI

    Los Angeles, CA
    2 days ago
  • $235k - $250k

    Saviynt's AI-powered identity platform manages and governs human and non-human access to...  ...at scale Influence architecture and engineering culture at a company level Competitive...  ...consumers. Establish and enhance centralized Observability and Monitoring platforms and tools that... 

    Medium

    Milpitas, CA
    2 days ago
  • Principal Engineer, AI Platform & Infrastructure About the Role SPREEAI is building the future of AI-powered commerce through photorealistic virtual...  ...to build the infrastructure, deployment pipelines, and observability systems that enable multimodal AI models to move from... 

    SpreeAI

    San Francisco, CA
    2 days ago
  • Vercel is seeking a Sr. Engineering Manager to lead the Reliability & Resilience and Developer Productivity...  ...strategy, execution, and leadership across core platform domains including CI/CD, Kubernetes, and observability. The ideal candidate has over 3 years of management... 
    Flexible hours

    vercel.com

    San Francisco, CA
    23 hours ago
  • The Walt Disney Company (France) is seeking a proactive Server Engineer to support management and compliance needs at Walt Disney...  ...maintenance and security of server infrastructure across various platforms. The ideal candidate needs over 10 years of experience in relevant... 
    Full time

    The Walt Disney Company (France)

    Glendale, CA
    23 hours ago
  • $280k - $350k

    Join careers.bitkraft.vc as a Staff / Principal Platform Engineer in Mountain View, California. In this role, you will take ownership of building and scaling AI products, manage cloud infrastructure, and enhance engineering workflows. With a hybrid work model, we seek an... 
    Relocation

    careers.bitkraft.vc - Jobboard

    Mountain View, CA
    1 day ago
  • A leading design platform seeks an experienced Engineering Manager for Observability. In this role, you will lead a team to enhance the visibility and efficiency of the platform, optimizing costs and implementing innovative AI-driven solutions. The ideal candidate will... 
    Remote job
    Full time

    Figma

    San Francisco, CA
    23 hours ago
  • Netflix, Inc. is seeking an experienced Engineering Manager to lead the Client Delivery & Observability (CDO) team. In this role, you will ensure every client release, server canary, and A/B test is safely delivered while building a high-performing team of engineers. Responsibilities... 
    Flexible hours

    Netflix, Inc.

    Los Gatos, CA
    2 days ago
  • A leading technology company is seeking a Site Reliability Engineer in Cupertino, California. The role involves owning the reliability...  ...systems, and collaborating with engineering teams for observability and automation. Candidates should have substantial experience... 

    Apple Inc.

    Cupertino, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Principal Platform Engineer, Observability (CIPE). Be the first to apply!