HPC Observability Engineer
e-IT Professionals Corp.
22 hours ago Be among the first 25 applicants Get AI-powered advice on this job and more exclusive features. Direct message the job poster from EIT Professionals Corp Role: HPC Observability Engineer (Python, HPC) Location: Remote Contract Description: The client has Grafana and InfluxDB services running on K8S in-house on-premises. Telegraf is used to ingest data from a GPU HPC cluster into InfluxDB. This engineer will help collect and visualize data for the “Terra” platform. The HPC Observability Engineer should have experience in: Setting up and maintaining Grafana dashboards for HPC environments Creating drill-down dashboards for servers, including metrics like memory, network, and CPU utilization Exploring and utilizing out-of-the-box metrics from InfluxDB Writing Python scripts for data ingestion into InfluxDB with examples Developing a proof of concept with a simple Python script to monitor load Ingesting Infiniband packet data Monitoring LSF jobs in various states Visualizing server-specific and cluster-wide metrics in Grafana Optional: Integrating third-party plugins like DDN’s Lustre, Mellanox fabric, etc. Qualifications and Skills: B.Tech, MS, or PhD in Computer Science or related field 5-8 years of experience with Grafana, InfluxDB, and Telegraf Experience in Python and Bash scripting is a plus Knowledge of Docker and Google Cloud Platform is advantageous HPC operations experience is beneficial Strong communication skills and ability to work independently Proficiency in requirements analysis and automated testing Ability to write efficient, secure, and well-documented Python code Experience with Git and pipeline development Awareness of modern security and development practices Responsibilities: Develop and leverage Grafana dashboards and Telegraf configurations Create dashboards for server and cluster metrics Develop Python scripts for data ingestion and documentation Visualize non-native resources in Grafana Optional: Integrate third-party plugins Maintain high-quality code and documentation Collaborate with teams to troubleshoot and optimize pipelines Desired Skills: Python (good to have) Bash scripting (good to have) Docker (must) HPC operations and LSF (good to have) Experience with DDN Lustre, Mellanox fabric (good to have) Google Cloud Platform (good to have) Knowledge of Git (must) Seniority level: Mid-Senior level Employment type: Contract Job function: Engineering and Information Technology Industries: IT Services and IT Consulting This job is active and accepting applications. #J-18808-Ljbffr
- ...Principal Systems Engineer (HPC, Python/Go) New York, NY (Hybrid, 3 days in office) Highly competitive compensation package... ...configuration distribution across multiple data centers. Define Observability Strategy: Drive the architecture for a modern observability...SuggestedWork at office
$150k - $240k
...-centric compute. We are looking for an Engineering Manager, Datacenter Storage Engineering... ...or similar parallel filesystems used in HPC and AI environments. End-to-End Performance... ...and lifecycle management. Automation & Observability: Build automation for provisioning,...SuggestedRemote workFlexible hours- ...fal is seeking an Operations Engineer for HPC Networking to maintain InfiniBand and Ethernet fabrics. This hands-on role involves bringing up new fabrics, monitoring production performance, and troubleshooting complex issues. Ideal candidates have experience with InfiniBand...Suggested
$144.2k - $288.4k
CVS Health is seeking a Principal AIOps Engineer in New York, NY to modernize IT operations with a focus on building an intelligent... ...experience, scripting skills in Python, and experience with observability platforms. This full-time position offers a competitive salary...SuggestedFull time$144.2k - $288.4k
CVS Health® is seeking a Principal AIOps Engineer in Georgia, USA. This full-time role involves leading the AIOps strategy and operational... ...or production operations and experience with ServiceNow and observability platforms. The salary range for this position is $144,200 - $...SuggestedFull time- P2P is seeking an HPC Data Center Production Engineer to drive automation and tooling in data center operations based in New York or Chicago. This development-heavy role emphasizes reducing manual interventions for hardware onboarding, while leveraging cutting-edge technologies...
- ...Space Executive is seeking a Fullstack Engineer to develop core product experiences for their AI observability platform. This role encompasses frontend engineering, distributed systems, and applied AI. You will work on building fullstack features across TypeScript, React...Remote work
$248k - $349k
Senior Staff Site Reliability Engineer, Cloud Observability corporate_fare Google place New York, NY, USA Apply Bachelor’s degree in Computer Science, a related field, or equivalent practical experience. 8 years of experience with data structures and algorithms. 8 years...Full time- ...problems. Jump's infrastructure team is a global organization of Engineers who architect, build, and maintain our world-class... ...connectivity, to building low-latency wide area networks and world-class HPC research clusters, we leverage research and automation to consistently...For contractorsWeekend workAfternoon shift
- A growing infrastructure company is seeking a Senior Systems Engineer to support the Department of Energy. This role involves guiding national... .... Strong technical presentation skills and understanding of HPC and AI/ML are crucial. Join us for this pivotal opportunity at...
- ...description Our core mission at Railway is to make software engineers higher leverage. We believe that people should be given powerful... ..., in real-time, of threshold breaches Craft rich backend observability APIs, working with product to build amazing experiences for instantly...Monday to Friday
- ...We are looking for a Chief HPC Network Engineer to define the global technical strategy, reference architecture, and engineering vision... ...GPU networking, SmartNIC/DPU technologies, and deep network observability. As a principal technical authority, you will shape engineering...
$175k - $250k
...Senior HPC Engineer Millennium's Infrastructure organization designs, engineers, and operates a robust global computing platform supporting WorldQuant's quantitative research. We are seeking a Senior HPC Engineer to join our team in a senior, hands-on role building...- ...professional to join their remote team, focusing on architecting observability solutions and creating technical documentation. The ideal... ...will have 2-6 years of experience in DevOps, SRE, or Solutions Engineering and strong programming skills in languages like Go, Python,...Remote job
- A staffing and workforce solutions company is seeking a Systems Administrator / Observability Analyst responsible for optimizing the LogicMonitor platform. This remote role involves configuring monitoring across hybrid environments and providing strategic advisory to ensure...Remote job
- ...Computer Science and extensive experience in UNIX/Linux server environments. The position includes responsibilities such as deploying HPC clusters, scripting, and troubleshooting, and offers competitive compensation with benefits including medical, dental, and a 401(k)...
$70k - $99k
...LMI Government Consulting is seeking a highly motivated DevOps Engineer to join their health project team. This role focuses on observability, reliability, and automation within healthcare technology operations. The successful candidate will work with cloud infrastructure...- ...is an opportunity to work with high-end HPC systems which are in great demand, and a... ...seeking an experienced Network Automation Engineer to join our technology team and tackle... ...in networking environments. Exposure to observability tools like Prometheus, Grafana, or NetBox...Remote workFlexible hours
$165k - $242k
...HPC Performance Engineer CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence. Trusted by leading AI labs, startups, and...Temporary workCasual workWork at officeRemote workFlexible hours$109k - $204k
...HPC Engineer New York, NY/ Bellevue, WA/ Sunnyvale, CA / Livingston, NJ CoreWeave is The Essential Cloud for AI™. Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence...Permanent employmentTemporary workCasual workWork at officeRemote workWorldwideFlexible hours- ...health technology company based in NYC is seeking an experienced engineer to lead infrastructure efforts. You will manage and enhance... ...infrastructure, spearheading initiatives for compliance and observability while supporting the company's growth. The ideal candidate...Flexible hours
- A pioneering AI cloud services company based in New York, NY, is looking for an HPC Performance Engineer to optimize bare-metal systems and enhance performance analysis. The candidate will work on developing tools for performance baselines, maintaining regression test...
$165k - $242k
Join to apply for the Systems Engineer, Kernel role at CoreWeave CoreWeave is The Essential... ...runtimes (containerd, nydus, kubelet) HPC/AI workloads (CUDA, GPUDirect, RoCE/... ...diagnostics and tooling for kernel-level observability. Work closely with HPC and Fleet teams...Permanent employmentFull timeTemporary workCasual workWork at officeRemote workFlexible hours- A technology consulting firm in New York is seeking an entry-level engineer to configure and set up Dynatrace for observability across teams. This full-time position requires collaboration with application support and SRE teams to implement effective monitoring solutions...Full time
- ...A leading FinTech company in the United Kingdom is looking for an experienced Observability Engineer to join their core Infrastructure team. This fully remote contract role, lasting 6 months outside IR35, involves designing and implementing monitoring and observability...Contract workRemote work
$110k - $130k
Neuberger Berman’s Technology team is seeking an Observability Engineer to lead and evolve our observability strategy across cloud and on-premise environments. You will help build and operate a server monitoring platform that continuously validates service health (24/7...Work at officeLocal areaShift work$150k - $250k
...the world’s best systematic trading and engineering talent. We empower portfolio managers to... ...images Implement robust monitoring and observability for performance‑critical workloads Lead... ...cloud and cloud‑to‑cloud migrations of HPC and distributed workloads Collaborate with...Casual workWork at office- ...Conviction. Join us and help build the platform engineers turn to to ship AI products. THE... ...of high-performance computing (HPC) and Large Language Model (LLM) engineering... ...compute/networking stack. Monitoring & Observability: Develop real-time dashboards and alerts...Flexible hours
- A leading consumer insights company is seeking a Senior Data Quality Engineer to enhance data reliability and drive data quality initiatives. The candidate will establish standards, lead strategies for data governance, and work with cross-functional teams to deliver data...Full time
$150k - $180k
A leading health technology company is seeking a highly skilled Platform Developer to join their platform team. The role focuses on building and maintaining a robust infrastructure to enhance their healthcare solutions. Responsibilities include developing automation pipelines...Flexible hours
Do you want to receive more vacancies?
Subscribe and receive similar vacancies to HPC Observability Engineer. Be the first to apply!

