PyLatencyMap

First release: September 2013, latest updates: May 2016.

Relevant blog articles and references:

PyLatencyMap is a tool with an accompanying set of scripts for collecting and displaying latency heat maps on the command line interface. PyLatencyMap has been originally developed as an aid for performance troubleshooting and benchmarking of storage for Oracle databases. PyLatencyMap can be used/extended to display heat maps from a generic source of latency data. PyLatencyMap is written in Python and has been tested with Python 2.6 and 2.7. The tool works from the command line and requires a terminal with support for ANSI graphic characters.

The architecture is modular, data is generated by a collection script and piped through a filter if needed and finally to the visualization engine. Example: data_source | | python LatencyMap.py

Examples and getting started:

PyLatencyMap comes with scripts and examples for SystemTap, BPF/bcc, DTrace, Oracle wait event interface, NetApp and for custom data collection scripts. A series of example scripts are provided, feel free to experiment and extend. Some pointers:

For investigating block I/O latency start with Example9_SystemTap_blockIO_req.sh: stap -v SystemTap/blockio_rq_issue_pylatencymap.stp 3 |python SystemTap/systemtap_connector.py |python LatencyMap.py
For investigating Oracle "db file sequential read events"" start with Example1_oracle_random_read.sh: sqlplus -S / as sysdba @Event_histograms_oracle/ora_latency.sql "db file sequential read" 3 | python LatencyMap.py

Frequency-Intensity Latency Heat Maps:

The representation of latency histograms over time is a 3D visualization problem. As shown by Brendan Gregg and co-workers, heat maps are an excellent way to solve it. PyLatencyMap displays latency data using 2 heat maps:

the frequency heat map, is used to visualize the number of operations (waits events)
the intensity/importance heat map, displays where the time is spent.

Frequency heat maps provide information on how many operations are coming from each latency source, for example what fraction of IOPS come from controller cache (or SSD) and what fraction come from spindles. Intensity/importance heat maps highlight the total weight that each latency bucket has. For example this allows to identify where most of the I/O time is spent: I/O served by storage cache, by spindles or by the effect of I/O outliers (for example: 1 long wait of 1 sec weighs as 1000 short waits of 1ms). For the purposes of this tool data for the intensity heat map is estimated from the same histogram data used for the frequency heat map (ideally it should come from separate counters for additional precision).

PyLatencyMap: integrates a variety of latency data sources into a visualization engine. One of the underlying ideas is to keep the tool structure simple and very close to command-line administration style. Three types of scripts are available and they work together in a chain: with a simple pipe operator the output of one step becomes the input of the next. Data source scripts to extract data in latency histogram format from Oracle, SystemTap, BPF/bcc, DTrace, tracefiles, etc. Data connector scripts may be needed to convert the data source data into the custom format used by the visualization engine. Finally the visualization engine LatencyMap.py produces the Frequency-Intensity heat maps. ANSI escape codes are the simple solution used to print color in a text environment.

Currently available data sources and connectors: SystemTap, Oracle wait event interface histograms, BPF/bcc, NetApp C-mode performance counters, DTrace, Oracle AWR event histogram data, Oracle 10046 trace data. For each of them example scripts are provided. More data sources may be added in future versions (contributions are welcome BTW).

Command line options:

--num_records=arg      : number of time intervals displayed, default is 90
--min_bucket=arg       : the lower latency bucket value is 2^arg, default is -1 (autotune)
--max_bucket=arg       : the highest latency bucket value is 2^arg, , default is 64 (autotune)
--frequency_maxval=arg : default is -1 (autotune)
--intensity_maxval=arg : default is -1 (autotune)
--screen_delay=arg     : used to add time delay when replaying trace files, default is 0.1 (sec)
--debug_level=arg      : debug level, default is 0, 5 is max debug level

Notes on the custom Input Data Format:

Data is split in records delimited by and
Each record contains a latency histogram and additional metadata
Timestamp data is this format: timestamp, microsec,,
Time units used for latency data can be specified as: latencyunit, {millisec|microsec|nanosec}
Data label: label,
Latency data is in this format: bucket,value where bucket must be a power of 2 (1,2,4,..2^N)
See also an example in SampleData/example_latency_data.txt

References and acknowledgements:

The main source of inspiration for this work is Brendan Gregg’s ACM queue article "Visualizing System Latency" of May 2008 and the subsequent blog articles and tools described at http://www.brendangregg.com/HeatMaps/latency.html A tool for displaying storage latency using ANSI coloring has recently also been introduced in sysdig. See https://sysdig.com/aws-storage-latency-sysdig-spectrogram/

rakeshthekkath/PyLatencyMap

PyLatencyMap