Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

Tech Lead - Network Observability

$180k - $260k

Clockwork Inc

About Clockwork Systems

Clockwork.io - Software Driven Fabrics to increase GPU cluster utilization

Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI fabrics by delivering cross-stack observability to catch and quickly resolve problems, workload fault tolerance to keep jobs running through failures, and performance acceleration that dynamically routes and paces traffic to avoid congestion.
To learn more, visit

About the Role

We are seeking an experienced Tech Lead to lead the architecture, development, and scaling of a high-performance network monitoring and observability platform. This role will focus on building systems that provide deep visibility into RDMA, RoCE, InfiniBand, and TCP/IP networks. The ideal candidate has strong experience in distributed systems, Linux networking, and modern observability stacks (e.g., Grafana/Prometheus).
What You Will Do
  • Lead architecture, design, and development of scalable network monitoring platforms for high-performance RDMA, RoCE, InfiniBand, and TCP/IP infrastructure.
  • Build backend telemetry services, observability dashboards, alerts, diagnostics, anomaly detection, SLA monitoring, and traffic analysis workflows.
  • Troubleshoot complex production issues across application, OS, server, RDMA, and network layers while optimizing low-latency collection, aggregation, and alerting.
  • Establish engineering standards, drive automation, define technical roadmaps with cross-functional teams, and mentor engineers on distributed systems and high-performance networking best practices.
What We're Looking For
  • Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field.
  • Strong hands-on programming experience in C++, Go, Python, Rust, or similar systems programming languages.
  • Proven experience leading engineering teams, major technical initiatives, or complex infrastructure projects.
  • Experience building distributed systems, backend services, telemetry pipelines, or observability platforms.
  • Hands-on experience with RDMA, RoCE, InfiniBand, or other high-performance network fabrics.
  • Familiarity with libibverbs, RDMA verbs, RDMA CM, queue pairs, completion queues, memory registration, and related RDMA concepts.
  • Strong knowledge of Linux networking, TCP/IP, DNS, routing, MTU, congestion control, packet loss, latency, and performance tuning.
  • Experience with traceroute-style diagnostics, path discovery, network reachability checks, synthetic probes, or active network measurements.
  • Experience with monitoring and visualization platforms such as Prometheus, Grafana, Datadog, Splunk, OpenTelemetry, or similar tools.
  • Strong debugging skills across software, operating system, server, and network layers.
  • Experience operating production systems in Linux-based environments.
  • Strong architectural judgment and ability to design systems for reliability, scalability, and operational simplicity.
Nice to Have
  • Experience supporting AI/ML, HPC, storage, or GPU cluster infrastructure workloads.
  • Experience with large-scale RoCE or InfiniBand deployments.
  • Experience with NCCL, distributed training infrastructure, or AI cluster diagnostics.
  • Experience with eBPF, XDP, DPDK, perf, tcpdump, Wireshark, ethtool, iproute2, rdma-core, or Linux kernel networking tools.
  • Experience with cloud infrastructure on AWS, GCP, or Azure.
  • Experience with Kubernetes, service discovery, configuration management, and infrastructure automation.
  • Knowledge of security, compliance, and infrastructure best practices.
  • Experience designing time-series data systems, alerting pipelines, or high-cardinality telemetry platforms.
Enjoy
  • Challenging projects.
  • A friendly and inclusive workplace culture.
  • Competitive compensation.
  • A great benefits package.
  • Catered lunch.

Compensation for this position will vary based on the skills and experience you bring, as well as internal equity considerations. For candidates hired at the posted level, the expected base salary range is $180,000 - $260,000. The offered compensation package may also include stock options or other equity awards, subject to Clockwork's equity program and applicable approvals

In addition to cash compensation, this role is eligible to participate in the company's equity program, which may include stock options granted in accordance with the company's equity plan and subject to approval and applicable vesting schedules.

Clockwork Systems is an equal opportunity employer. We are committed to building world-class teams by welcoming bright, passionate individuals from all backgrounds. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity or expression, national origin, disability, or protected veteran status. We believe diversity drives innovation, and we grow stronger together.
Vacancy posted 6 hours ago
Similar jobs that could be interesting for youBased on the Tech Lead - Network Observability in Palo Alto, CA vacancy
  • $180k - $260k

    Clockwork.io in Palo Alto is seeking a Tech Lead to architect and develop a high-performance network monitoring platform. This role demands strong programming skills in languages such as C++, Go, or Python and significant experience with distributed systems and networking... 
    Network

    Clockwork.io

    Palo Alto, CA
    5 days ago
  • $235k - $295k

     ...of millions of virtual machines, generating terabytes of logs and processing exabytes of data per day. At our scale, we observe cloud hardware, network, and operating system faults, and our software must gracefully shield our customers from any of the above. As a... 
    Network
    Local area
    Worldwide

    Databricks

    Mountain View, CA
    6 hours ago
  • $235k - $295k

     ...Sr. Staff Software Engineer, Observability Location: Mountain View, California At Databricks, we are passionate about enabling data teams...  ...exabytes of data per day. At our scale, we observe cloud hardware, network, and operating system faults, and our software must gracefully... 
    Network

    Databricks

    Mountain View, CA
    4 days ago
  • $200k - $287.5k

     ...redefine the future of how work gets done. Observe by Snowflake is an AI-powered...  ...root cause and resolution 10x faster. Leading engineering teams at companies like Capital...  ...programming: concurrency, memory management, networking, and I/O A track record of solving... 
    Network
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    3 days ago
  • $205k - $310k

     ...Backend Platform Tech Lead Palo Alto, CA • Engineering • Hybrid • Full-time Instrumental technology is used by the world’s most admired...  ...for highly specialized industries, such as manufacturing, networking, cybersecurity, and securities trading. We’re a growing team that... 
    Network
    Full time

    Clutch Canada

    Palo Alto, CA
    6 hours ago
  • $251k - $310k

     ...Waymo is seeking a Staff Technical Lead Manager to lead their ML Evaluation team. This role involves defining the strategic vision for...  ...closely with modeling teams to validate deep neural networks. The ideal candidate will have over 5 years of experience in large... 
    Network

    Waymo

    Mountain View, CA
    4 days ago
  • LinkedIn is the world’s largest professional network, built to create economic opportunity...  ...team. We’re hiring a Data Foundations Lead to architect and scale the core data foundations...  ...business. Embed quality, controls, and observability: Define quality checks, reconciliation... 
    Network
    For contractors
    Work at office
    Flexible hours

    LinkedIn

    Sunnyvale, CA
    3 days ago
  •  ...Software Developer or Manager to serve as a Tech Lead in our Santa Clara team. The NVLink team develops the firmware and network OS (NVOS) for NVLink, NVIDIA’s networking...  .... Establish guidelines for evaluation, observability, and continuous improvement of our networking... 
    Network

    NVIDIA AI

    Santa Clara, CA
    4 days ago
  • $207k - $300k

     ...engineers who bring fresh ideas from all areas—including information retrieval, distributed computing, large‑scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design, and mobile—and who are ready to... 
    Network
    Full time

    Google Inc.

    Mountain View, CA
    5 days ago
  • $224k - $356.5k

     ...Software Developer or manager to serve as Tech Lead in our team in Santa Clara. The NVLink team develops the firmware and network OS (NVOS) for NVLink, NVIDIA’s networking...  ...base. Establish guidelines for evaluation, observability, and continuous improvement of our networking... 
    Network

    NVIDIA Gruppe

    Santa Clara, CA
    6 hours ago
  • $262k - $365k

     ...software development. 7 years of experience leading technical project strategy, ML design,...  ...computing, large-scale system design, networking and data storage, security, artificial intelligence...  ...forward. In this role, you will tech‑lead a team of AI/ML software engineers to... 
    Network

    Google

    Mountain View, CA
    6 days ago
  •  ...Senior Staff Software Engineer – Observability Platform Rootshell Enterprise Technologies...  ...scale. Real-Time Data Orchestration: Lead the design of high-throughput messaging...  ...experience specifically in large-scale network engineering, telemetry, or observability... 
    Network

    Rootshell Inc

    Redwood City, CA
    4 days ago
  • $200k - $287.5k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform built on...  ...root cause of production issue and resolution 10x faster. Leading engineering teams at companies like Capital One, Topgolf, and... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    3 days ago
  •  ...platforms and services that improve how complex infrastructure is observed, understood, and operated. You bring experience developing...  ...systems, orchestration platforms, compute infrastructure, storage, networking, cloud services, and business‑critical enterprise platforms.... 
    Network

    Synopsys

    Sunnyvale, CA
    4 days ago
  • $200k - $287.5k

     ...to the next level. We are looking for a Senior Engineer in Observability to help define and build the next generation of AI powered observability...  ...platforms such as AWS, Azure, or GCP Proven ability to lead complex technical projects and influence architecture... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    2 days ago
  • $200k - $287.5k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform engineered...  ...velocity with the reach and ecosystem of one of the world's leading data platforms. We are hiring a Senior Software Engineer to own... 
    Flexible hours

    Snowflake Computing

    Menlo Park, CA
    6 hours ago
  • $200k - $287.5k

     ...function, but to help redefine the future of how work gets done. Observe by Snowflake is an AI-powered observability platform built on...  ...from detection to root cause and resolution 10x faster. Leading engineering teams at companies like Capital One, Topgolf, and Dialpad... 
    Temporary work
    Flexible hours

    Streamlit

    Menlo Park, CA
    5 days ago
  •  ...redefine the future of how work gets done. Observe by Snowflake is an AI-powered...  ...production issue and resolution 10x faster. Leading engineering teams at companies like Capital...  ...frameworks. Prior experience in a tech lead or staff engineer capacity on a product... 

    Snowflake Computing

    Menlo Park, CA
    1 day ago
  • $180k - $260k

     ...AI fabrics by delivering cross-stack observability to catch and quickly resolve problems,...  ...looking for a passionate and experienced Tech Lead - Frontend / Full Stack to join our...  ...and turning complex infrastructure and network data into clear, intuitive visual experiences... 
    Network

    Clockwork.io

    Palo Alto, CA
    2 days ago
  •  ...Senior Technical Leader At Databricks, observability and governance are what turn a massive, multi-tenant data and AI platform into one...  ...year architecture all three surfaces are built on. Design and lead high-impact projects that move the needle on performance,... 

    Colorwave Inc

    Mountain View, CA
    4 days ago
  • $240k - $400k

     ...on, customer facing delivery. You will lead builds across Node.js services, AI and agent...  ...secure, scalable systems across networking, autoscaling, multi tenant patterns. Proficiency...  ...Code using Terraform or CDK and strong observability with metrics, tracing, logs, and SLOs.... 
    Network
    Visa sponsorship

    Pear VC

    Palo Alto, CA
    4 days ago
  •  ...Tech Lead, AI Compute Infrastructure Los Angeles, Palo Alto, San Francisco, Toronto, Singapore About HeyGen At HeyGen, our...  ...training, and continuous evaluation/benchmarking. Enhance Observability: Develop world-class observability, tracing, and... 
    Full time

    HeyGen

    Palo Alto, CA
    2 days ago
  •  ...days a week at our Mountain View, CA office. What You Will Do Lead electronic components selection required for an Autonomous Vehicle...  ...-line tools Experience troubleshooting vehicle communication networks like CAN, CAN-FD, LIN Experience creating harnesses, soldering... 
    Network
    Work at office

    Booster

    Mountain View, CA
    6 days ago
  • $140k - $250k

     ...composition, security, leveraging native services, resiliency and observability. Understanding of how to design and leverage IaaS, PaaS and SaaS delivery models. Experience designing products leveraging network, security, compute and storage domains. Deep understanding and... 
    Network
    Temporary work
    Work at office
    Flexible hours

    PowerToFly

    Menlo Park, CA
    4 days ago
  • $250k - $300k

     ...Tech Lead Manager of Agentic Runtime Team The Tech Lead Manager of the Agentic Runtime team builds the low‑latency, reliable, and...  ...and safety. You'll work across distributed systems, production observability, and ML infra integrations to deliver an experience that... 
    Home office

    Colorwave Inc

    Mountain View, CA
    4 days ago
  • $250k - $300k

     ...Watch (2026), Forbes AI 50, and Gartner's Tech Innovators in Agentic AI, Glean continues...  .... About the Role: The Tech Lead Manager of the Agentic Runtime team builds...  ...work across distributed systems, production observability, and ML infra integrations to deliver an... 
    Home office
    Flexible hours

    Glean.info

    Mountain View, CA
    1 day ago
  • $142k - $215.45k

     ...Robotics and Digital) is seeking a highly experienced Technical Lead, Robotics and Automation, to architect, develop, and scale next...  ...offline programming/simulation, controls (PLC/industrial PC, motion, networks/edge), and vision/sensing & EOAT/fixture interfaces. Strong... 
    Network
    Temporary work
    Local area
    Immediate start

    6267-Auris Health Inc. Legal Entity

    Santa Clara, CA
    4 days ago
  • Role Overview Own and build the full diagnostic, observability, and RCA infrastructure that makes Sage Care’s AI assistant Role Overview Own and build the full diagnostic, observability, and RCA infrastructure that makes Sage Care’s AI assistant trustworthy and debuggable... 
    Immediate start

    Sage Care

    Palo Alto, CA
    3 days ago
  •  ...NVIDIA AI is seeking a Senior Software Developer or Manager to serve as Tech Lead in Santa Clara. This role involves spearheading development of features for NVLink, NVIDIA's networking fabric, collaborating with various engineering teams to resolve customer issues. We... 
    Network

    NVIDIA AI

    Santa Clara, CA
    6 hours ago
  • $15.36k - $23.04k

     ...Lead Systems Engineer - Traffic Management USA, Durham; USA, Miami; USA, Palo Alto...  ...service mesh, strengthening resilience and observability, and pushing our capabilities so that...  ...service communication. Worked with AWS networking and compute primitives (ALB/NLB,... 
    Network
    Work at office
    Work from home
    Relocation package
    Flexible hours

    Nubank

    Palo Alto, CA
    2 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to Tech Lead - Network Observability. Be the first to apply!