Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Tech Lead - Network Observability

$180k - $260k

Clockwork.io

Job Description

Job Description

About Clockwork Systems

Clockwork.io – Software Driven Fabrics to increase GPU cluster utilization

Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI fabrics by delivering cross-stack observability to catch and quickly resolve problems, workload fault tolerance to keep jobs running through failures, and performance acceleration that dynamically routes and paces traffic to avoid congestion.

To learn more, visit

About the Role

We are seeking an experienced Tech Lead to lead the architecture, development, and scaling of a high-performance network monitoring and observability platform. This role will focus on building systems that provide deep visibility into RDMA, RoCE, InfiniBand, and TCP/IP networks. The ideal candidate has strong experience in distributed systems, Linux networking, and modern observability stacks (e.g., Grafana/Prometheus).

What You Will Do
  • Lead architecture, design, and development of scalable network monitoring platforms for high-performance RDMA, RoCE, InfiniBand, and TCP/IP infrastructure.
  • Build backend telemetry services, observability dashboards, alerts, diagnostics, anomaly detection, SLA monitoring, and traffic analysis workflows.
  • Troubleshoot complex production issues across application, OS, server, RDMA, and network layers while optimizing low-latency collection, aggregation, and alerting.
  • Establish engineering standards, drive automation, define technical roadmaps with cross-functional teams, and mentor engineers on distributed systems and high-performance networking best practices.
What We're Looking For
  • Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field.
  • Strong hands-on programming experience in C++, Go, Python, Rust, or similar systems programming languages.
  • Proven experience leading engineering teams, major technical initiatives, or complex infrastructure projects.
  • Experience building distributed systems, backend services, telemetry pipelines, or observability platforms.
  • Hands-on experience with RDMA, RoCE, InfiniBand, or other high-performance network fabrics.
  • Familiarity with libibverbs, RDMA verbs, RDMA CM, queue pairs, completion queues, memory registration, and related RDMA concepts.
  • Strong knowledge of Linux networking, TCP/IP, DNS, routing, MTU, congestion control, packet loss, latency, and performance tuning.
  • Experience with traceroute-style diagnostics, path discovery, network reachability checks, synthetic probes, or active network measurements.
  • Experience with monitoring and visualization platforms such as Prometheus, Grafana, Datadog, Splunk, OpenTelemetry, or similar tools.
  • Strong debugging skills across software, operating system, server, and network layers.
  • Experience operating production systems in Linux-based environments.
  • Strong architectural judgment and ability to design systems for reliability, scalability, and operational simplicity.
Nice to Have
  • Experience supporting AI/ML, HPC, storage, or GPU cluster infrastructure workloads.
  • Experience with large-scale RoCE or InfiniBand deployments.
  • Experience with NCCL, distributed training infrastructure, or AI cluster diagnostics.
  • Experience with eBPF, XDP, DPDK, perf, tcpdump, Wireshark, ethtool, iproute2, rdma-core, or Linux kernel networking tools.
  • Experience with cloud infrastructure on AWS, GCP, or Azure.
  • Experience with Kubernetes, service discovery, configuration management, and infrastructure automation.
  • Knowledge of security, compliance, and infrastructure best practices.
  • Experience designing time-series data systems, alerting pipelines, or high-cardinality telemetry platforms.

Enjoy

  • Challenging projects.
  • A friendly and inclusive workplace culture.
  • Competitive compensation.
  • A great benefits package.
  • Catered lunch.

Compensation for this position will vary based on the skills and experience you bring, as well as internal equity considerations. For candidates hired at the posted level, the expected base salary range is $180,000 - $260,000. The offered compensation package may also include stock options or other equity awards, subject to Clockwork's equity program and applicable approvals

In addition to cash compensation, this role is eligible to participate in the company's equity program , which may include stock options granted in accordance with the company's equity plan and subject to approval and applicable vesting schedules.

Clockwork Systems is an equal opportunity employer. We are committed to building world-class teams by welcoming bright, passionate individuals from all backgrounds. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity or expression, national origin, disability, or protected veteran status. We believe diversity drives innovation, and we grow stronger together.

Vacancy posted a month ago
Similar jobs that could be interesting for youBased on the Tech Lead - Network Observability in Palo Alto, CA vacancy
  • $180k - $260k

    Clockwork.io in Palo Alto is seeking a Tech Lead to architect and develop a high-performance network monitoring platform. This role demands strong programming skills in languages such as C++, Go, or Python and significant experience with distributed systems and networking... 
    Network

    Clockwork.io

    Palo Alto, CA
    4 days ago
  • $180k - $260k

     ...AI fabrics by delivering cross-stack observability to catch and quickly resolve problems,...  ...looking for a passionate and experienced Tech Lead - Frontend / Full Stack to join our...  ...and turning complex infrastructure and network data into clear, intuitive visual experiences... 
    Network

    Clockwork.io

    Palo Alto, CA
    21 days ago
  • $235k - $295k

     ...of millions of virtual machines, generating terabytes of logs and processing exabytes of data per day. At our scale, we observe cloud hardware, network, and operating system faults, and our software must gracefully shield our customers from any of the above. As a... 
    Network
    Local area
    Worldwide

    Databricks

    Mountain View, CA
    4 days ago
  • $235k - $295k

    Sr. Staff Software Engineer, Observability Location: Mountain View, California At Databricks, we are passionate about enabling data teams...  ...exabytes of data per day. At our scale, we observe cloud hardware, network, and operating system faults, and our software must gracefully... 
    Network

    Databricks Inc.

    Mountain View, CA
    13 hours ago
  • $200k - $287.5k

     ...redefine the future of how work gets done. Observe by Snowflake is an AI-powered...  ...root cause and resolution 10x faster. Leading engineering teams at companies like Capital...  ...programming: concurrency, memory management, networking, and I/O A track record of solving... 
    Network
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    2 days ago
  •  ...LinkedIn is the world's largest professional network, built to create economic opportunity...  ...We're hiring a Data Foundations Lead to architect and scale the core data foundations...  .... Embed quality, controls, and observability: Define quality checks, reconciliation routines... 
    Network
    For contractors
    Work at office
    Flexible hours

    LinkedIn

    Sunnyvale, CA
    3 days ago
  • $251k - $310k

    Waymo is seeking a Staff Technical Lead Manager to lead their ML Evaluation team. This role involves defining the strategic vision for...  ...closely with modeling teams to validate deep neural networks. The ideal candidate will have over 5 years of experience in large... 
    Network

    Waymo

    Mountain View, CA
    3 days ago
  • $205k - $310k

    Backend Platform Tech Lead Palo Alto, CA • Engineering • Hybrid • Full-time Instrumental technology is used by the world’s most admired...  ...for highly specialized industries, such as manufacturing, networking, cybersecurity, and securities trading. We’re a growing team... 
    Network
    Full time

    Clutch Canada

    Palo Alto, CA
    2 days ago
  • $207k - $300k

    Tech Lead, YouTube Shorts Discovery, ML Recommendations corporate_fare YouTube place Mountain View, CA, USA Bachelor’s degree or equivalent...  ...retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural... 
    Network
    Full time

    Google Inc.

    Mountain View, CA
    1 day ago
  • $207k - $300k

    Technical Lead, Native Commerce Integrations corporate_fare Google place Mountain View, CA, USA Apply Bachelor's degree or equivalent...  ...retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural... 
    Network
    Full time
    Local area

    Google Inc.

    Mountain View, CA
    1 day ago
  • $212k - $318.4k

    Senior Software Engineer - AI Observability - AI, Search & Knowledge Platform Cupertino, California...  ...collaborate with a team of engineers to lead the design and development of user-...  ...platforms, Kubernetes, object storage, networking, databases, and observability services... 
    Network
    Relocation package

    Apple Inc.

    Cupertino, CA
    1 day ago
  • $207k - $301k

    Tech Lead Manager, Google Analytics Gold Processing Backend Mountain View, CA, USA Qualifications Bachelor's degree or equivalent practical...  ...developing large‑scale infrastructure, distributed systems or networks, or experience with compute technologies, storage, or hardware... 
    Network

    Google Inc.

    Mountain View, CA
    4 days ago
  • $207k - $300k

     ...engineers who bring fresh ideas from all areas—including information retrieval, distributed computing, large‑scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design, and mobile—and who are ready to... 
    Network
    Full time

    Google Inc.

    Mountain View, CA
    4 days ago
  •  ...Snowflake is hiring a Senior Software Engineer in Menlo Park, CA, to lead the evolution of our APM and AI observability products. This role requires expertise in backend development, complex data pipelines, and collaboration across teams. Candidates should have a BS in... 

    Snowflake Computing

    Menlo Park, CA
    3 days ago
  • $200k - $287.5k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform built on...  ...root cause of production issue and resolution 10x faster. Leading engineering teams at companies like Capital One, Topgolf, and... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    3 days ago
  • $200k - $287.5k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI‑powered observability platform engineered...  ...velocity with the reach and ecosystem of one of the world’s leading data platforms. We are hiring a Senior Software Engineer to own... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    4 days ago
  •  ...Snowflake is hiring a Senior Software Engineer in Menlo Park, California. In this role, you will own and drive the evolution of our AI observability and APM products. Responsibilities include building and optimizing streaming data pipelines, designing backend services, and... 

    Snowflake Computing

    Menlo Park, CA
    3 days ago
  • $200k - $287.5k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform built on...  ...from detection to root cause and resolution 10x faster. Leading engineering teams at companies like Capital One, Topgolf, and Dialpad... 
    Temporary work
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    3 days ago
  • $200k - $287.5k

     ...redefine the future of how work gets done. Observe by Snowflake is an AI-powered...  ...production issue and resolution 10x faster. Leading engineering teams at companies like Capital...  ...frameworks. Prior experience in a tech lead or staff engineer capacity on a product... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    5 days ago
  • $160k - $200k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform built on...  ...from detection to root cause and resolution 10x faster. Leading engineering teams at companies like Capital One, Topgolf, and Dialpad... 
    Immediate start
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    5 days ago
  • $200k - $287.5k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform built on...  ...libraries that strengthen Observe's position as a leading OTel destination. Collaborate with the OpenTelemetry open-source... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    3 days ago
  •  ...Software Engineer - Observability, Mid-Level Join to apply for the Software Engineer - Observability, Mid-Level role at Jobright.ai Software Engineer - Observability, Mid-Level 2 days ago Be among the first 25 applicants Join to apply for the Software... 
    Full time
    H1b

    jobright.com

    Palo Alto, CA
    3 days ago
  • $200k - $287.5k

     ...to the next level. We are looking for a Senior Engineer in Observability to help define and build the next generation of AI powered observability...  ...platforms such as AWS, Azure, or GCP Proven ability to lead complex technical projects and influence architecture... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    1 day ago
  •  ...seeking Senior Staff Software Engineer Observability Platform for one of our client, Please share...  ...scale. Real-Time Data Orchestration: Lead the design of high-throughput messaging...  ...experience specifically in large-scale network engineering, telemetry, or observability... 
    Network

    Rootshell Enterprise Technologies

    Redwood City, CA
    5 days ago
  • $224k - $356.5k

    ## Senior Software Development Tech Lead - NVLink FW and NVOSApplylocations: US, CA, Santa...  ...NVLink team develops the firmware and network OS (NVOS) for NVLink, NVIDIA’s...  ...* Establish guidelines for evaluation, observability, and continuous improvement of our networking... 
    Network

    NVIDIA Corporation

    Santa Clara, CA
    1 day ago
  • $240k - $400k

     ...on, customer facing delivery. You will lead builds across Node.js services, AI and agent...  ...secure, scalable systems across networking, autoscaling, multi tenant patterns. Proficiency...  ...Code using Terraform or CDK and strong observability with metrics, tracing, logs, and SLOs.... 
    Network
    Visa sponsorship

    Pear VC

    Palo Alto, CA
    1 day ago
  •  ...Tech Lead, AI Compute Infrastructure Los Angeles, Palo Alto, San Francisco, Toronto, Singapore About HeyGen At HeyGen, our...  ...training, and continuous evaluation/benchmarking. Enhance Observability: Develop world-class observability, tracing, and... 
    Full time

    HeyGen

    Palo Alto, CA
    1 day ago
  •  ...days a week at our Mountain View, CA office. What You Will Do Lead electronic components selection required for an Autonomous Vehicle...  ...-line tools Experience troubleshooting vehicle communication networks like CAN, CAN-FD, LIN Experience creating harnesses, soldering... 
    Network
    Work at office

    Booster

    Mountain View, CA
    13 hours ago
  • $140k - $250k

     ...leveraging native services, resiliency and observability. Understanding of how-to design and...  ...designing products leveraging network, security, compute and storage domain...  ...putting clients first, doing the right thing, leading with exceptional ideas, committing to diversity... 
    Network
    Temporary work
    Work at office
    Flexible hours

    Morgan Stanley

    Menlo Park, CA
    5 days ago
  • $250k - $300k

     ...Watch (2026), Forbes AI 50, and Gartner's Tech Innovators in Agentic AI, Glean continues...  .... About the Role: The Tech Lead Manager of the Agentic Runtime team builds...  ...work across distributed systems, production observability, and ML infra integrations to deliver an... 
    Home office
    Flexible hours

    Glean.info

    Mountain View, CA
    5 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Tech Lead - Network Observability. Be the first to apply!