Sign up to access all features of our service.
  • Job search
  • Favorites
  • Create a CV
    New
  • Salaries
  • Subscriptions

HPC Observability Engineer

e-IT Professionals Corp.

22 hours ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Direct message the job poster from EIT Professionals Corp Role: HPC Observability Engineer (Python, HPC) Location: Remote Contract Description: The client has Grafana and InfluxDB services running on K8S in-house on-premises. Telegraf is used to ingest data from a GPU HPC cluster into InfluxDB. This engineer will help collect and visualize data for the “Terra” platform. The HPC Observability Engineer should have experience in: Setting up and maintaining Grafana dashboards for HPC environments Creating drill-down dashboards for servers, including metrics like memory, network, and CPU utilization Exploring and utilizing out-of-the-box metrics from InfluxDB Writing Python scripts for data ingestion into InfluxDB with examples Developing a proof of concept with a simple Python script to monitor load Ingesting Infiniband packet data Monitoring LSF jobs in various states Visualizing server-specific and cluster-wide metrics in Grafana Optional: Integrating third-party plugins like DDN’s Lustre, Mellanox fabric, etc. Qualifications and Skills: B.Tech, MS, or PhD in Computer Science or related field 5-8 years of experience with Grafana, InfluxDB, and Telegraf Experience in Python and Bash scripting is a plus Knowledge of Docker and Google Cloud Platform is advantageous HPC operations experience is beneficial Strong communication skills and ability to work independently Proficiency in requirements analysis and automated testing Ability to write efficient, secure, and well-documented Python code Experience with Git and pipeline development Awareness of modern security and development practices Responsibilities: Develop and leverage Grafana dashboards and Telegraf configurations Create dashboards for server and cluster metrics Develop Python scripts for data ingestion and documentation Visualize non-native resources in Grafana Optional: Integrate third-party plugins Maintain high-quality code and documentation Collaborate with teams to troubleshoot and optimize pipelines Desired Skills: Python (good to have) Bash scripting (good to have) Docker (must) HPC operations and LSF (good to have) Experience with DDN Lustre, Mellanox fabric (good to have) Google Cloud Platform (good to have) Knowledge of Git (must) Seniority level: Mid-Senior level Employment type: Contract Job function: Engineering and Information Technology Industries: IT Services and IT Consulting This job is active and accepting applications. #J-18808-Ljbffr

Vacancy posted 4 days ago
Similar jobs that could be interesting for youBased on the HPC Observability Engineer in New York, NY vacancy
  •  ...Principal Systems Engineer (HPC, Python/Go) New York, NY (Hybrid, 3 days in office) Highly competitive compensation package...  ...configuration distribution across multiple data centers. Define Observability Strategy: Drive the architecture for a modern observability... 
    Suggested
    Work at office

    Elliot Partnership

    New York, NY
    3 days ago
  • $150k - $240k

     ...-centric compute. We are looking for an Engineering Manager, Datacenter Storage Engineering...  ...or similar parallel filesystems used in HPC and AI environments. End-to-End Performance...  ...and lifecycle management. Automation & Observability: Build automation for provisioning,... 
    Suggested
    Remote work
    Flexible hours

    RunPod, Inc.

    New York, NY
    4 days ago
  •  ...fal is seeking an Operations Engineer for HPC Networking to maintain InfiniBand and Ethernet fabrics. This hands-on role involves bringing up new fabrics, monitoring production performance, and troubleshooting complex issues. Ideal candidates have experience with InfiniBand... 
    Suggested

    fal

    New York, NY
    4 days ago
  • $144.2k - $288.4k

    CVS Health is seeking a Principal AIOps Engineer in New York, NY to modernize IT operations with a focus on building an intelligent...  ...experience, scripting skills in Python, and experience with observability platforms. This full-time position offers a competitive salary... 
    Suggested
    Full time

    Hispanic Alliance for Career Enhancement

    New York, NY
    4 days ago
  • $144.2k - $288.4k

    CVS Health® is seeking a Principal AIOps Engineer in Georgia, USA. This full-time role involves leading the AIOps strategy and operational...  ...or production operations and experience with ServiceNow and observability platforms. The salary range for this position is $144,200 - $... 
    Suggested
    Full time

    Hispanic Alliance for Career Enhancement

    New York, NY
    3 days ago
  • P2P is seeking an HPC Data Center Production Engineer to drive automation and tooling in data center operations based in New York or Chicago. This development-heavy role emphasizes reducing manual interventions for hardware onboarding, while leveraging cutting-edge technologies... 

    P2P

    New York, NY
    3 days ago
  •  ...Space Executive is seeking a Fullstack Engineer to develop core product experiences for their AI observability platform. This role encompasses frontend engineering, distributed systems, and applied AI. You will work on building fullstack features across TypeScript, React... 
    Remote work

    Space Executive

    New York, NY
    4 days ago
  • $248k - $349k

    Senior Staff Site Reliability Engineer, Cloud Observability corporate_fare Google place New York, NY, USA Apply Bachelor’s degree in Computer Science, a related field, or equivalent practical experience. 8 years of experience with data structures and algorithms. 8 years... 
    Full time

    Google Inc.

    New York, NY
    2 days ago
  •  ...problems. Jump's infrastructure team is a global organization of Engineers who architect, build, and maintain our world-class...  ...connectivity, to building low-latency wide area networks and world-class HPC research clusters, we leverage research and automation to consistently... 
    For contractors
    Weekend work
    Afternoon shift

    P2P

    New York, NY
    3 days ago
  • A growing infrastructure company is seeking a Senior Systems Engineer to support the Department of Energy. This role involves guiding national...  .... Strong technical presentation skills and understanding of HPC and AI/ML are crucial. Join us for this pivotal opportunity at... 

    VAST Data

    New York, NY
    2 days ago
  •  ...description Our core mission at Railway is to make software engineers higher leverage. We believe that people should be given powerful...  ..., in real-time, of threshold breaches Craft rich backend observability APIs, working with product to build amazing experiences for instantly... 
    Monday to Friday

    RAIL-WAY INC

    New York, NY
    4 days ago
  •  ...We are looking for a Chief HPC Network Engineer to define the global technical strategy, reference architecture, and engineering vision...  ...GPU networking, SmartNIC/DPU technologies, and deep network observability. As a principal technical authority, you will shape engineering... 

    EPAM Systems Inc

    New York, NY
    4 days ago
  • $175k - $250k

     ...Senior HPC Engineer Millennium's Infrastructure organization designs, engineers, and operates a robust global computing platform supporting WorldQuant's quantitative research. We are seeking a Senior HPC Engineer to join our team in a senior, hands-on role building... 

    Millennium Management Corp

    New York, NY
    3 days ago
  •  ...professional to join their remote team, focusing on architecting observability solutions and creating technical documentation. The ideal...  ...will have 2-6 years of experience in DevOps, SRE, or Solutions Engineering and strong programming skills in languages like Go, Python,... 
    Remote job

    Alumni Ventures

    New York, NY
    2 hours ago
  • A staffing and workforce solutions company is seeking a Systems Administrator / Observability Analyst responsible for optimizing the LogicMonitor platform. This remote role involves configuring monitoring across hybrid environments and providing strategic advisory to ensure... 
    Remote job

    ManpowerGroup

    New York, NY
    2 days ago
  •  ...Computer Science and extensive experience in UNIX/Linux server environments. The position includes responsibilities such as deploying HPC clusters, scripting, and troubleshooting, and offers competitive compensation with benefits including medical, dental, and a 401(k)... 

    Penguin Solutions

    New York, NY
    3 days ago
  • $70k - $99k

     ...LMI Government Consulting is seeking a highly motivated DevOps Engineer to join their health project team. This role focuses on observability, reliability, and automation within healthcare technology operations. The successful candidate will work with cloud infrastructure... 

    LMI Government Consulting

    New York, NY
    4 days ago
  •  ...is an opportunity to work with high-end HPC systems which are in great demand, and a...  ...seeking an experienced Network Automation Engineer to join our technology team and tackle...  ...in networking environments. Exposure to observability tools like Prometheus, Grafana, or NetBox... 
    Remote work
    Flexible hours

    Hydra Host

    New York, NY
    4 days ago
  • $165k - $242k

     ...HPC Performance Engineer CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and... 
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    New York, NY
    3 days ago
  • $109k - $204k

     ...HPC Engineer New York, NY/ Bellevue, WA/ Sunnyvale, CA / Livingston, NJ CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence... 
    Permanent employment
    Temporary work
    Casual work
    Work at office
    Remote work
    Worldwide
    Flexible hours

    CoreWeave

    New York, NY
    3 days ago
  •  ...health technology company based in NYC is seeking an experienced engineer to lead infrastructure efforts. You will manage and enhance...  ...infrastructure, spearheading initiatives for compliance and observability while supporting the company's growth. The ideal candidate... 
    Flexible hours

    Amperos Health

    New York, NY
    4 days ago
  • A pioneering AI cloud services company based in New York, NY, is looking for an HPC Performance Engineer to optimize bare-metal systems and enhance performance analysis. The candidate will work on developing tools for performance baselines, maintaining regression test... 

    CoreWeave

    New York, NY
    4 days ago
  • $165k - $242k

    Join to apply for the Systems Engineer, Kernel role at CoreWeave CoreWeave is The Essential...  ...runtimes (containerd, nydus, kubelet) HPC/AI workloads (CUDA, GPUDirect, RoCE/...  ...diagnostics and tooling for kernel-level observability. Work closely with HPC and Fleet teams... 
    Permanent employment
    Full time
    Temporary work
    Casual work
    Work at office
    Remote work
    Flexible hours

    CoreWeave

    New York, NY
    4 days ago
  • A technology consulting firm in New York is seeking an entry-level engineer to configure and set up Dynatrace for observability across teams. This full-time position requires collaboration with application support and SRE teams to implement effective monitoring solutions... 
    Full time

    IT Minds LLC

    New York, NY
    4 days ago
  •  ...A leading FinTech company in the United Kingdom is looking for an experienced Observability Engineer to join their core Infrastructure team. This fully remote contract role, lasting 6 months outside IR35, involves designing and implementing monitoring and observability... 
    Contract work
    Remote work

    Oliver Bernard

    New York, NY
    4 days ago
  • $110k - $130k

    Neuberger Berman’s Technology team is seeking an Observability Engineer to lead and evolve our observability strategy across cloud and on-premise environments. You will help build and operate a server monitoring platform that continuously validates service health (24/7... 
    Work at office
    Local area
    Shift work

    LGBT Great

    New York, NY
    1 day ago
  • $150k - $250k

     ...the world’s best systematic trading and engineering talent. We empower portfolio managers to...  ...images Implement robust monitoring and observability for performance‑critical workloads Lead...  ...cloud and cloud‑to‑cloud migrations of HPC and distributed workloads Collaborate with... 
    Casual work
    Work at office

    Tower Research Capital

    New York, NY
    1 day ago
  •  ...Conviction. Join us and help build the platform engineers turn to to ship AI products. THE...  ...of high-performance computing (HPC) and Large Language Model (LLM) engineering...  ...compute/networking stack. Monitoring & Observability: Develop real-time dashboards and alerts... 
    Flexible hours

    Baseten

    New York, NY
    4 days ago
  • A leading consumer insights company is seeking a Senior Data Quality Engineer to enhance data reliability and drive data quality initiatives. The candidate will establish standards, lead strategies for data governance, and work with cross-functional teams to deliver data... 
    Full time

    Affinity Solutions, Inc.

    New York, NY
    1 day ago
  • $150k - $180k

    A leading health technology company is seeking a highly skilled Platform Developer to join their platform team. The role focuses on building and maintaining a robust infrastructure to enhance their healthcare solutions. Responsibilities include developing automation pipelines...
    Flexible hours

    Capital Rx

    New York, NY
    3 days ago

Do you want to receive more vacancies?

Subscribe and receive similar vacancies to HPC Observability Engineer. Be the first to apply!