Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Tech Lead - Network Observability

$180k - $260k

Clockwork Inc

About Clockwork Systems

Clockwork.io - Software Driven Fabrics to increase GPU cluster utilization

Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI fabrics by delivering cross-stack observability to catch and quickly resolve problems, workload fault tolerance to keep jobs running through failures, and performance acceleration that dynamically routes and paces traffic to avoid congestion.
To learn more, visit

About the Role

We are seeking an experienced Tech Lead to lead the architecture, development, and scaling of a high-performance network monitoring and observability platform. This role will focus on building systems that provide deep visibility into RDMA, RoCE, InfiniBand, and TCP/IP networks. The ideal candidate has strong experience in distributed systems, Linux networking, and modern observability stacks (e.g., Grafana/Prometheus).
What You Will Do
  • Lead architecture, design, and development of scalable network monitoring platforms for high-performance RDMA, RoCE, InfiniBand, and TCP/IP infrastructure.
  • Build backend telemetry services, observability dashboards, alerts, diagnostics, anomaly detection, SLA monitoring, and traffic analysis workflows.
  • Troubleshoot complex production issues across application, OS, server, RDMA, and network layers while optimizing low-latency collection, aggregation, and alerting.
  • Establish engineering standards, drive automation, define technical roadmaps with cross-functional teams, and mentor engineers on distributed systems and high-performance networking best practices.
What We're Looking For
  • Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field.
  • Strong hands-on programming experience in C++, Go, Python, Rust, or similar systems programming languages.
  • Proven experience leading engineering teams, major technical initiatives, or complex infrastructure projects.
  • Experience building distributed systems, backend services, telemetry pipelines, or observability platforms.
  • Hands-on experience with RDMA, RoCE, InfiniBand, or other high-performance network fabrics.
  • Familiarity with libibverbs, RDMA verbs, RDMA CM, queue pairs, completion queues, memory registration, and related RDMA concepts.
  • Strong knowledge of Linux networking, TCP/IP, DNS, routing, MTU, congestion control, packet loss, latency, and performance tuning.
  • Experience with traceroute-style diagnostics, path discovery, network reachability checks, synthetic probes, or active network measurements.
  • Experience with monitoring and visualization platforms such as Prometheus, Grafana, Datadog, Splunk, OpenTelemetry, or similar tools.
  • Strong debugging skills across software, operating system, server, and network layers.
  • Experience operating production systems in Linux-based environments.
  • Strong architectural judgment and ability to design systems for reliability, scalability, and operational simplicity.
Nice to Have
  • Experience supporting AI/ML, HPC, storage, or GPU cluster infrastructure workloads.
  • Experience with large-scale RoCE or InfiniBand deployments.
  • Experience with NCCL, distributed training infrastructure, or AI cluster diagnostics.
  • Experience with eBPF, XDP, DPDK, perf, tcpdump, Wireshark, ethtool, iproute2, rdma-core, or Linux kernel networking tools.
  • Experience with cloud infrastructure on AWS, GCP, or Azure.
  • Experience with Kubernetes, service discovery, configuration management, and infrastructure automation.
  • Knowledge of security, compliance, and infrastructure best practices.
  • Experience designing time-series data systems, alerting pipelines, or high-cardinality telemetry platforms.
Enjoy
  • Challenging projects.
  • A friendly and inclusive workplace culture.
  • Competitive compensation.
  • A great benefits package.
  • Catered lunch.

Compensation for this position will vary based on the skills and experience you bring, as well as internal equity considerations. For candidates hired at the posted level, the expected base salary range is $180,000 - $260,000. The offered compensation package may also include stock options or other equity awards, subject to Clockwork's equity program and applicable approvals

In addition to cash compensation, this role is eligible to participate in the company's equity program, which may include stock options granted in accordance with the company's equity plan and subject to approval and applicable vesting schedules.

Clockwork Systems is an equal opportunity employer. We are committed to building world-class teams by welcoming bright, passionate individuals from all backgrounds. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity or expression, national origin, disability, or protected veteran status. We believe diversity drives innovation, and we grow stronger together.
Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the Tech Lead - Network Observability in Palo Alto, CA vacancy
  • $180k - $260k

    Clockwork.io in Palo Alto is seeking a Tech Lead to architect and develop a high-performance network monitoring platform. This role demands strong programming skills in languages such as C++, Go, or Python and significant experience with distributed systems and networking... 
    Network

    Clockwork.io

    Palo Alto, CA
    4 days ago
  •  ...Strong coding in Python / Go ~ Deep Kubernetes (clusters, networking, operators) ~ Distributed systems cloud-native design ~ AWS...  ...with Product, AI Security teams Nice to Have: Observability tools (Prometheus, Grafana, Datadog) Multi-cloud / hybrid... 
    Network
    Remote work

    Yantran LLC

    Palo Alto, CA
    4 days ago
  • $190k - $261.25k

     ...of millions of virtual machines, generating terabytes of logs and processing exabytes of data per day. At our scale, we observe cloud hardware, network, and operating system faults, and our software must gracefully shield our customers from any of the above. As a... 
    Network
    Worldwide

    Databricks

    Mountain View, CA
    3 days ago
  • $180k - $260k

     ...AI fabrics by delivering cross-stack observability to catch and quickly resolve problems,...  ...looking for a passionate and experienced Tech Lead - Frontend / Full Stack to join our...  ...and turning complex infrastructure and network data into clear, intuitive visual experiences... 
    Network

    Clockwork.io

    Palo Alto, CA
    10 days ago
  • $200k - $287.5k

     ...redefine the future of how work gets done. Observe by Snowflake is an AI-powered...  ...root cause and resolution 10x faster. Leading engineering teams at companies like Capital...  ...programming: concurrency, memory management, networking, and I/O A track record of solving... 
    Network
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    2 days ago
  • $235k - $295k

    Sr. Staff Software Engineer, Observability Location: Mountain View, California At Databricks, we are passionate about enabling data teams...  ...exabytes of data per day. At our scale, we observe cloud hardware, network, and operating system faults, and our software must gracefully... 
    Network

    Databricks Inc.

    Mountain View, CA
    16 hours ago
  •  ...Tech Lead- Python + Spark (PySpark) Location: Palo Alto, CA (Onsite from day 1) Duration: Contract/Fulltime Job Description:...  ...services around AWS BigData/Analytics) ~ Good understanding on AWS networking (VPC) for connectivity between jobs running in AWS VPC and... 
    Network
    Full time
    Contract work

    Zortech Solutions

    Palo Alto, CA
    21 hours ago
  •  ...LinkedIn is the world's largest professional network, built to create economic opportunity...  ...We're hiring a Data Foundations Lead to architect and scale the core data foundations...  .... Embed quality, controls, and observability: Define quality checks, reconciliation routines... 
    Network
    For contractors
    Work at office
    Flexible hours

    LinkedIn

    Sunnyvale, CA
    3 days ago
  • $205k - $310k

    Backend Platform Tech Lead Palo Alto, CA • Engineering • Hybrid • Full-time Instrumental technology is used by the world’s most admired...  ...for highly specialized industries, such as manufacturing, networking, cybersecurity, and securities trading. We’re a growing team... 
    Network
    Full time

    Clutch Canada

    Palo Alto, CA
    2 days ago
  • $207k - $300k

    Tech Lead, YouTube Shorts Discovery, ML Recommendations corporate_fare YouTube place Mountain View, CA, USA Bachelor’s degree or equivalent...  ...retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural... 
    Network
    Full time

    Google Inc.

    Mountain View, CA
    1 day ago
  • $207k - $300k

    Technical Lead, Native Commerce Integrations corporate_fare Google place Mountain View, CA, USA Apply Bachelor's degree or equivalent...  ...retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural... 
    Network
    Full time
    Local area

    Google Inc.

    Mountain View, CA
    1 day ago
  • $207k - $300k

     ...engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is... 
    Network
    Full time

    Google

    Mountain View, CA
    4 days ago
  • $200k - $287.5k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform built on...  ...root cause of production issue and resolution 10x faster. Leading engineering teams at companies like Capital One, Topgolf, and... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    2 days ago
  • $160k - $200k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform built on...  ...from detection to root cause and resolution 10x faster. Leading engineering teams at companies like Capital One, Topgolf, and Dialpad... 
    Immediate start
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    16 hours ago
  • $200k - $287.5k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform engineered...  ...velocity with the reach and ecosystem of one of the world's leading data platforms. We are hiring a Senior Software Engineer to own... 
    Flexible hours

    Streamlit

    Menlo Park, CA
    4 days ago
  • $200k - $287.5k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform built on...  ...from detection to root cause and resolution 10x faster. Leading engineering teams at companies like Capital One, Topgolf, and Dialpad... 
    Temporary work
    Flexible hours

    Streamlit

    Menlo Park, CA
    4 days ago
  • $200k - $287.5k

     ...redefine the future of how work gets done. Observe by Snowflake is an AI-powered...  ...production issue and resolution 10x faster. Leading engineering teams at companies like Capital...  ...frameworks. Prior experience in a tech lead or staff engineer capacity on a product... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    5 days ago
  •  ...Senior Staff Software Engineer – Observability Platform Rootshell Enterprise Technologies...  ...scale. Real-Time Data Orchestration: Lead the design of high-throughput messaging...  ...experience specifically in large-scale network engineering, telemetry, or observability... 
    Network

    Rootshell Inc

    Redwood City, CA
    3 days ago
  • $240k - $400k

     ...on, customer facing delivery. You will lead builds across Node.js services, AI and agent...  ...secure, scalable systems across networking, autoscaling, multi tenant patterns. Proficiency...  ...Code using Terraform or CDK and strong observability with metrics, tracing, logs, and SLOs.... 
    Network
    Visa sponsorship

    Pear VC

    Palo Alto, CA
    1 day ago
  •  ...Senior Lead Software Engineer Be an integral part of an agile team that's constantly...  ...at JPMorganChase within the VPC Network organization, you provide deep engineering...  ...practices (CI/CD, infrastructure as code, observability) ~ Experience applying expertise and new... 
    Network

    Chase

    Palo Alto, CA
    3 days ago
  • $100k

     ...Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance...  ...software models, compilers, platforms, networking, and semiconductors. Our diverse team...  ...data grows by orders of magnitude. Observability and telemetry are key to ensuring our... 
    Network
    Permanent employment
    Full time

    Tenstorrent

    Santa Clara, CA
    16 hours ago
  • $152k - $215k

     ...top-notch technology products. As a Lead Software Engineer at JPMorgan Chase within...  ...or C++. Experience with graph neural networks and graph processing frameworks (DGL,...  ...with model monitoring, A/B testing, and ML observability tools. Familiarity with MLOps tools... 
    Network

    JPMorgan Chase Bank, N.A.

    Palo Alto, CA
    4 days ago
  •  ...Lead Software Engineer, Vice President As a Lead Software Engineer, Vice President, at JPMorgan Chase within the Commercial...  ...-native solutions on AWS (compute, storage, security/IAM, networking, observability) ~ Strong experience with LLMs and agentic systems,... 
    Network
    Bank staff
    Immediate start

    Chase

    Palo Alto, CA
    4 days ago
  • $300 per month

     ...We are looking for a Staff Software Engineer to lead the architecture and evolution of Crusoe's observability platform at scale. In this role, you will define...  ...OpenTelemetry) integrated with service meshes, APIs, and networking layers Establishing observability standards... 
    Network
    Temporary work

    Crusoe

    Sunnyvale, CA
    5 days ago
  • $140k - $250k

     ...leveraging native services, resiliency and observability. Understanding of how-to design and...  ...designing products leveraging network, security, compute and storage domain...  ...putting clients first, doing the right thing, leading with exceptional ideas, committing to diversity... 
    Network
    Temporary work
    Work at office
    Flexible hours

    Morgan Stanley

    Menlo Park, CA
    16 hours ago
  •  ...days a week at our Mountain View, CA office. What You Will Do Lead electronic components selection required for an Autonomous Vehicle...  ...-line tools Experience troubleshooting vehicle communication networks like CAN, CAN-FD, LIN Experience creating harnesses, soldering... 
    Network
    Work at office

    Booster

    Mountain View, CA
    16 hours ago
  • $250k - $300k

     ...Watch (2026), Forbes AI 50, and Gartner's Tech Innovators in Agentic AI, Glean continues...  .... About the Role: The Tech Lead Manager of the Agentic Runtime team builds...  ...work across distributed systems, production observability, and ML infra integrations to deliver an... 
    Home office
    Flexible hours

    Glean.info

    Mountain View, CA
    16 hours ago
  • $15.36k - $23.04k

     ...Lead Systems Engineer - Traffic Management USA, Durham; USA, Miami; USA, Palo Alto...  ...service mesh, strengthening resilience and observability, and pushing our capabilities so that...  ...service communication. Worked with AWS networking and compute primitives (ALB/NLB,... 
    Network
    Work at office
    Work from home
    Relocation package
    Flexible hours

    Nubank

    Palo Alto, CA
    22 hours ago
  • $212k - $318.4k

    Senior Software Engineer - AI Observability - AI, Search & Knowledge Platform Cupertino, California...  ...collaborate with a team of engineers to lead the design and development of user-...  ...platforms, Kubernetes, object storage, networking, databases, and observability services... 
    Network
    Relocation package

    Apple Inc.

    Cupertino, CA
    1 day ago
  •  ...Contribute to sprint retrospectives with quality metrics and observations from the test cycle Technical Investigation & Quality Engineering...  ...Reproduce and root-cause complex bugs in distributed, networked security products Partner with developers to define... 
    Network
    Internship
    Worldwide

    Xage Security

    Palo Alto, CA
    4 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Tech Lead - Network Observability. Be the first to apply!