#6 EXAMON – Holistic supercomputer facility management
Examon is a distributed and scalable monitoring infrastructure which combines delocalized monitoring
agents (SW and HW) and distributed database.
The testbed focuses on data-driven prescriptive maintenance and optimization of large computing
facilities. Computing centres are today composed by thousands of computing nodes featuring parallel and
heterogeneous computing elements. These computing nodes are aggregated in racks, and racks in
computing rooms. In addition to the computing racks, computing rooms host storage racks and cooling
equipment. To reduce their operational/maintenance costs, datacentres should embed holistic live
monitoring support: monitoring data come from multiple heterogeneous sources, i.e., on-chip and on-
board sensors integrated on the compute nodes, as well as sensors at the node, rack, and room level.
The Examon infrastructure will be enriched with machine learning for power and performance model
construction for heterogeneous machines (CPU + Accelerator nodes); in addition, Examon will be also
integrated with data-driven anomaly detection and enable proactive maintenance of the involved HPC
nodes and sub-systems.
Objectives
Data collection for predictive power and performance model construction for heterogeneous machines (CPU + Accelerator nodes)
Integration with data-driven anomaly detection based on trained models to identify “straddlers” (unusually slow nodes) and thermal hotspots and enable proactive maintenance of the involved HPC nodes and sub-systems