/Chimbuko

Performance Analysis Framework for Scientific Workflows for online Data Analysis and Performance Visualization

Primary LanguageHTML

Chimbuko

Introduction

The Chimbuko framework captures, analyzes and visualizes performance metrics for complex scientific workflows and relates these metrics to the context of their execution (provenance) on extreme-scale machines. The purpose of Chimbuko is to enable empirical studies of performance analysis for a software or a workflow during a development phase or in different computational environments.

Chimbuko enables the comparison of different runs at high and low levels of metric granularity by capturing and displaying aggregate statistics such as function profiles and counter averages, as well as maintaining detailed trace information. Because trace data can quickly escalate in volume for applications running on multi-node machines, the core of Chimbuko is an in-situ data reduction component that captures trace data from a running application instance (e.g. MPI rank) and applies machine learning to filter out anomalous function executions. By focusing primarily on performance anomalies, a significant reduction in data volume is achieved while maintaining detailed information regarding those events that impact the application performance.

Alongside providing a framework to allow for offline analysis of the data collected over the run, Chimbuko also provides an online visualization tool with which aggregated statistics and individual anomalous executions can be monitored in real-time.

The following figure shows the basic layout of the Chimbuko framework.

Chimbuko Basic Layout

  • The ADIOS framework orchestrates workflow and provides data streaming.
  • The TAU tool provides performance metrics for instrumented components 1 and 2. The tool extracts provenance metadata and trace data.
  • Trace data is dynamically analyzed to detect anomalies by the Online AD modules, and aggregate statistics are maintained on the parameter server.
  • Detailed provenance information regarding the detected anomalies is stored in the provenance database, an UnQLite JSON document-store remote database provided by the Mochi Sonata framework.
  • The visualization module allows for interaction with Chimbuko in real-time.

For more information about the design and working philosophy of Chimbuko, please see the documents directory.

Documentation

Detailed documentation on the API, installation and usage of the Chimbuko "PerformanceAnalysis" backend can be found here, and documentation on the visualization module can be found here.

Releases

The current v6.5 release includes:

  • Offline analysis command-line tooling that now supports interactive parsing and summarizing of global provenance data
  • Support for the Cray CXI provider used by HPE Slingshot 11 networks such as Frontier
  • Improved ease-of-use for using the Infiniband verbs provider
  • Significant performance optimizations and robustness/veracity improvements to the HBOS algorithm implementation
  • Experimental support for launching Chimbuko's services and online anomaly detection components through a single script
  • Various fixes and code improvements

This library provides C/C++ APIs to process TAU performance profile and traces.

This is a visualization framework for online performance analysis. This framework mainly focuses on visualizing real-time anomalous behaviors in a High Performance Computing application so that any patterns of anomalies that users might not have recognized can be effectively detected through online visual analytics.

Citations

For citing Chimbuko, please use:

C. Kelly et al., “Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool,” in ICPS Proceedings, in ISAV’20. online: Association for Computing Machinery, Nov. 2020, pp. 15–19. doi: 10.1145/3426462.3426465.