darshan-hpc/darshan

BUG, MAINT: log with negative heatmap bin values (from Slack)

tylerjereddy opened this issue · 6 comments

This log was provided by Nafise Moti on Slack, with debug prints in the darshan/datatypes/heatmap.py code suggesting we get negative heatmap bin values with print(rec['write_bins']).
moti_nek5000.tar.gz

I asked to see if they could regenerate the log with the latest darshan-runtime, in case it is simply an issue with the older runtime. If it is reproducible with the newer runtime, then it may be interesting to diagnose on the runtime side.

However, the initial log above is probably enough for us to start studying for evidence of i.e., overflow to produce large negative numbers, and to see if the darshan-parser agrees with PyDarshan bindings perhaps.

carns commented

It looks like there may be another problem with this log (either something orthogonal, or possibly the root cause of record corruption elsewhere). My first attempt to parse the log manually using a current origin/main build produced the following:

carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06> ~/working/install/bin/darshan-parser moti_nek5000_id11006669-50626_5-31-893-3074083273138734090_1.darshan |grep HEAT
# HEATMAP module: 172850 bytes (compressed), ver=1
AddressSanitizer:DEADLYSIGNAL
=================================================================
==625539==ERROR: AddressSanitizer: SEGV on unknown address 0x632000050000 (pc 0x7fc67814bda6 bp 0x63200004ffff sp 0x7ffdae52b858 T0)
==625539==The signal is caused by a WRITE memory access.
    #0 0x7fc67814bda6  (/lib/x86_64-linux-gnu/libz.so.1+0x3da6) (BuildId: e25bfc3865cecab5886349610fb620ce21758d11)
    #1 0x7fc6781577de in inflate (/lib/x86_64-linux-gnu/libz.so.1+0xf7de) (BuildId: e25bfc3865cecab5886349610fb620ce21758d11)
    #2 0x7fc6781a9b2f in darshan_log_libz_read ../../darshan-util/darshan-logutils.c:1681
    #3 0x7fc6781a934c in darshan_log_dzread ../../darshan-util/darshan-logutils.c:1567
    #4 0x7fc67819ff5b in darshan_log_get_mod ../../darshan-util/darshan-logutils.c:789
    #5 0x7fc6781d22b8 in darshan_log_get_lustre_record ../../darshan-util/darshan-lustre-logutils.c:95
    #6 0x557a1d9ad4a4 in main ../../darshan-util/darshan-parser.c:395
    #7 0x7fc677023a8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    #8 0x7fc677023b48 in __libc_start_main_impl ../csu/libc-start.c:360
    #9 0x557a1d9ab604 in _start (/home/carns/working/install/bin/darshan-parser+0x4604) (BuildId: 56758761c6dea4840907f84f13af5d89cc9b1f3e)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libz.so.1+0x3da6) (BuildId: e25bfc3865cecab5886349610fb620ce21758d11) 
==625539==ABORTING

My Darshan build is using gcc 12.2 and address sanitizer. Address sanitizer is triggering a segfault while darshan-parser is extracting Lustre module data, which happens before it gets to the heatmap module.

I can't validate yet, but it's possible that memory corruption is leading to the bogus heatmap data. I'll try hacking up the parser to extract only the heatmap data and see what it looks like.

carns commented

Ok, that was a red herring. When I hacked up darshan-parser to skip straight to the heatmap module I definitely see negative values:

carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06> ~/working/install/bin/darshan-parser moti_nek5000_id11006669-50626_5-31-893-3074083273138734090_1.darshan |grep HEATMAP |grep -v "^#" | cut -f 5 |sort -n |uniq -c |head -n 20
      8 -9223372036854775808
      4 -9223372036854774784
      7 -9223372036854773760
      1 -9223372036854772736
      3 -9223372036854771712
      3 -9223372036854769664
      1 -9223372036854767616
      1 -9223372036854765568
      1 -9223372036854764544
      2 -9223372036854762496
      1 -9223372036854761472
      1 -9223372036854758400
 527058 0
   2049 0.400000
      1 11
      1 26
      1 29
      1 32
      1 74
      1 130

It looks like a little over 30 bins have negative values.

carns commented

I tried this little test program to see what the max value of an int64 counter is:

carns-x1-7g ~/tmp> cat foo.c                                      e: darshan-dev
#include <stdint.h>
#include <limits.h>
#include <stdio.h>

int main(int argc, char **argv) {
    printf("int64 max: %ld\n", INT64_MAX);
}

carns-x1-7g ~/tmp> ./foo                                          e: darshan-dev
int64 max: 9223372036854775807

The negative values here look like they are very close to the most negative values possible on a signed 64 bit integer. It's impossible for the application to have accessed enough data to simply wrap around though. This has to be a bug in Darshan or some sort of corner case in the function arguments we are tracking.

carns commented

Another piece of the puzzle: all of these faulty values are in the stdio heatmap:

> ~/working/install/bin/darshan-parser moti_nek5000_id11006669-50626_5-31-893-3074083273138734090_1.darshan |grep HEATMAP |grep -v "^#" | grep "-"
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_1	-9223372036854775808	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_110	-9223372036854775808	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_112	-9223372036854773760	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_113	-9223372036854769664	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_114	-9223372036854765568	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_115	-9223372036854758400	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_116	-9223372036854764544	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_117	-9223372036854774784	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_118	-9223372036854771712	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_119	-9223372036854761472	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_120	-9223372036854773760	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_121	-9223372036854769664	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_122	-9223372036854771712	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_123	-9223372036854762496	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_124	-9223372036854767616	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_125	-9223372036854774784	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_126	-9223372036854773760	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_127	-9223372036854772736	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_128	-9223372036854775808	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_129	-9223372036854769664	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_130	-9223372036854762496	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_131	-9223372036854774784	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_132	-9223372036854773760	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_133	-9223372036854775808	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_134	-9223372036854775808	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_135	-9223372036854775808	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_136	-9223372036854773760	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_137	-9223372036854773760	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_138	-9223372036854775808	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_139	-9223372036854773760	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_140	-9223372036854774784	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_141	-9223372036854771712	heatmap:STDIO	UNKNOWN	UNKNOWN
HEATMAP	0	3989511027826779520	HEATMAP_WRITE_BIN_142	-9223372036854775808	heatmap:STDIO	UNKNOWN	UNKNOWN
carns commented

I think we might need a runtime reproducer to track this down. I'd like to see what the arguments look like on the way in to the heatmap_update() calls. Feels like a bad cast somewhere.

carns commented

Nafiseh also provided a log from a different run that does not exhibit this problem (this is again with the parser hacked to only show the heatmap data):

carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06> ~/working/install/bin/darshan-parser moti_nek5000_id11407458-51659_8-29-42872-3372088521547566391_5.darshan |grep -v "^#" | grep "-"
carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06 [0|0|1]> 

So this is possibly an application problem, but it would still be good to know if there was something we could have done to prevent bogus data from getting to the log or if it was some sort of uncontrollable problem (like a memory corruption).