Monitor High Performance Computing Systems Health with Psychart

This is A Guest Post

OpenHVAC periodically asks professionals across the community to generate content which we feel is relevant to our community members. OpenHVAC is not associated with or receives any compensation for any of the the products and services listed in this article.

Nicolas Ventura (pronounced “Nikola”) is a critical facilities engineer who has an invested interest in open source software. He obtained his master’s degree in mechanical and aerospace engineering in 2022, and his PE license in control systems engineering in 2024. Now at NERSC, Nicolas is supporting the deployment of the data center infrastructure management, and is developing open-source visualization and analysis tools for infrastructure data, such as Psychart.
– Nicolas Ventura, NERSC

The National Energy Research Scientific Computing Center (NERSC) is a modern data center, home to one of the most powerful high-performance computing (HPC) systems in the world for scientific research in genetics, physics, geology, and more. To ensure optimal performance by NERSC systems in support of science, the infrastructure team at NERSC has to closely track facility conditions.

Operated by Lawrence Berkeley National Laboratory (Berkeley Lab) in Berkeley, California, NERSC employs an exhaustive operational analytics model to monitor data from around the facility, including HPC telemetry, power consumption, and environmental information. That material is used to diagnose problems within the system, optimize efforts to reduce water and power consumption, and monitor overall facility health.

NERSC uses a collection of programs, databases, power meters, and sensors to perform near-real-time system analytics. This analysis is critical because any deviation outside certain air condition parameters can degrade our HPC system. For example, if the air is:

  • Too humid, it could lead to corrosion, tape media errors, anodic failures, and more.
  • Too cold, it could lead to an overworked mechanical system and less effective power usage.
  • Too dry, it could lead to electrostatic discharge.
  • Too hot, it could lead to overheating and reduce the overall system lifetime.

Because a significant portion of NERSC’s systems requires air cooling, itā€™s important that NERSC staff have access to accurate real-time monitoring of air conditions both outside and inside the data center and HPC systems. To make that happen, NERSC staff developed a psychrometric chart to provide detailed at-a-glance information on conditions. Hereā€™s how it happened and how itā€™s serving NERSC today.

Environmental Data Collection

The infrastructure and operations groups at NERSC gather environmental data such as temperatures, humidities, flow rates, pump speeds, and damper positions from our mechanical and control systems. These metrics come from our building management system and are streamed into a database that organizes data by timestamp to be visualized with software such as Grafana.

Accurate real-time monitoring and visualization of environmental conditions is important to ensure that NERSC system health, operational efficiency, and energy efficiency are maintained. If temperatures go too high, for example, HPC equipment could run at lower efficiencies at best, or be damaged at worst. And representing this data through visualizations makes it much easier for NERSC staff to monitor environmental conditions in response to control program changes. We selected Grafana as our platform for dashboards, an open-source platform.

Psychrometrics

ASHRAE (formerly the American Society of Heating, Refrigeration, and Air-Conditioning Engineers) has defined environmental parameters that have become a standard for data centers. These parameters are called psychometric properties: dry bulb temperature, wet bulb temperature, dew point temperature, and relative humidity, and are often visualized using a psychrometric chart. Their combination is called a psychrometric state, and staying within the allowable range assigned to them can maintain efficiency and maximize the lifetime of computer equipment.

Only two psychrometric properties are required to ā€œfix the state,ā€ meaning all other state variables can be calculated just by knowing any two of them. Therefore, if thermal/humidity sensors go out, this type of visualization will still work, as long as there are at least two independent sensors online. Building a functional psychrometric chart has been a long-term goal for the operations groups at NERSC.

Developing Psychart

One primary goal of the psychrometric chart development process was that the chart would  integrate seamlessly with NERSC’s existing monitoring system and be interactive and responsive to real-time updates. The initial vision was to plot a single time series data on a blank psychrometric chart with some way to indicate to the viewer which direction it was trending.

As it often is, getting started was the difficult part. The first iteration was a standalone version of Psychart that still exists and is maintained today. However, this version does not communicate with Grafana or any sort of databaseā€”it allowed the entire development focus to stay on the graphical user interface of the psychrometric chart, axis labels, plotted data, and data labels without other problems getting in the way.

Everything in the first iteration was written in JavaScript for browser compatibility with scalable vector graphics to render everything visible to the end user. Using a color gradient became the ideal method for expressing the history of data, with more saturated colors indicating the most recent data points.

Additionally, the software library called Psychrolib came packaged with mathematical functions for fixing the state using two psychrometric properties. The most interesting problem to solve during this phase of the project was translating a psychrometric state into an (x,y) coordinate pair for plotting data. It also enabled the generation of a blank psychrometric chart on the fly, which allowed for additional customization options to offset the bounds of the chart. Last, the ASHRAE thermal guidelines for air cooled data centers were embedded into Psychart, so that the user could toggle them on and off.

When it was time to integrate Psychart into Grafana, we imported code into the starter panel plugin, which came pre-packaged with the latest version of the Grafana software development kit (SDK) at the time. The SDK contains a library full of functions for accessing the Grafana application programming interface, including database values. Most of the issues we encountered came up during this phase of the project, including Psychart not being responsive to user input, and required support for different data frame formats. We found the Grafana community and community forums very helpful during this phase of the project.

Once Psychart had passed all its quality control tests, it was queued for publishing as a community plugin. Some communication was required between Berkeley Lab and Grafana because it would release the source code to the public, making it an open-source projectā€”but in June of 2022, it was copyrighted by the Regents of the University of California and released under a modified Berkeley Software Distribution license.

Usage at NERSC

Over the nearly two years of its lifetime, Psychart has been a huge benefit to NERSC and has seen many improvements. We installed it on NERSC’s Grafana instance and it immediately and seamlessly provided advantages in visualization in both expected and unexpected ways, including a psychrometric design calculator we built and integrated directly in Grafana!

Psychart was first used to monitor air handler supply conditions, which is the air entering the data center. This data is collected from the transmitters in our control system and allows us to interpret the internal mechanisms working inside the air handler, for example, sensible cooling versus evaporation.

It also facilitated in determining system health at a glance using the ASHRAE envelopes.

Psychrometric chart highlighted to depict the ideal air conditions for high performance computing systems. To the right is too hot, up high and along the edge is too humid, to the bottom left is too cold, and the bottom is too dry.

The live dashboard below shows a snapshot of the past 24 hours of data from an air handler’s supply air sensors.

Psychart visualization

Psychart can also visualize outdoor weather conditions using outdoor sensor data. The graph below shows the past year of psychrometric states using a different color scheme.

Additionally, we are able to significantly increase our accuracy of visualizing compute environmental conditions by using the data from cabinet inlet and chip onboard conditions from Cray telemetry. This has given us unprecedented insight into HPC environmental conditions and decreased our reaction time for conditions straying out of tolerated range.

Psychart has also been helpful with troubleshooting. It has helped especially during a period when the subsystems within our cooling distribution units were not functioning as intended. A cooling water valve was stuck open, causing temperatures to rise. We were able to detect those environmental changes in Psychart and proceed with the required maintenance of our cooling subsystems accordingly.

Continued Improvements

Over its lifetime, Psychart has undergone several improvements. Notably, it has been migrated from JavaScript to a TypeScript code base, which is a newer, more maintainable language that offers higher compatibility and performance improvements. The color schemes are now also better designed for both light and dark themes of Grafana, offering higher contrast for more recent data points. The Grafana SDK packaged with Psychart has been continuously updated to ensure compatibility with the latest version of Grafana, making updates as needed.

Two major updates have been made to Psychart since its conception. One such addition was to make it usable by industries other than HPC and IT data centers. Six predefined zones newly embedded into Psychart can be toggled on and off; they depend on on ambient air speed, metabolic rate (MET, or activity), and clothing level (CLO), based on ASHRAE Standard 55, familiarly known as human comfort zones.

These features could be used in other industries such as general office conditioning. Some future work for this would be to generalize these six regions into user-defined air speed, MET, and CLO parameters, that would dynamically generate these envelopes for fine-tuning.

At the request of multiple Grafana community users, the latest major update to Psychart introduced the ability to plot multiple data series on the same panel. This has further improved NERSC’s visibility of our air handler operation by plotting mixed, cooled, and supply air on the same plot, alongside outdoor air conditions, helping to ensure that our cooling systems are working properly.

The ability to plot multiple states on one chart also allows collapsing multiple compute cabinet conditions into a single panel. This adds a lot of data clustered around a central point, but a dashboard like this allows the viewer to quickly detect anomalies in our system and whether a particular cabinet requires more cooling, for example.

Conclusion & Acknowledgements

Psychart has proved itself to be a useful addition to NERSC’s extensive monitoring infrastructure and has received several thousand downloads, and significant attention in the HVAC community. It is also fully open-source and community contributions are welcome, especially from the OpenHVAC community! I’d like to acknowledge Berkeley Lab and the Department of Energy for sponsoring this project, as well as Jeffrey Broughton (Berkeley Lab) Norman Bourassa (Berkeley Lab), Benjamin Maxwell (Berkeley Lab), Benjamin Shaw (UC Davis), and Marcus Olsson (Grafana) for all their help and support throughout the development of this project.

Standalone app: https://psychart.nicfv.com/

Grafana download: https://grafana.com/grafana/plugins/ventura-psychrometric-panel/

GitHub repository: https://github.com/nicfv/Psychart

Leave a Comment

Your email address will not be published. Required fields are marked *