BUG, MAINT: log with negative heatmap bin values (from Slack)
tylerjereddy opened this issue · 6 comments
This log was provided by Nafise Moti on Slack, with debug prints in the darshan/datatypes/heatmap.py
code suggesting we get negative heatmap bin values with print(rec['write_bins'])
.
moti_nek5000.tar.gz
I asked to see if they could regenerate the log with the latest darshan-runtime
, in case it is simply an issue with the older runtime. If it is reproducible with the newer runtime, then it may be interesting to diagnose on the runtime side.
However, the initial log above is probably enough for us to start studying for evidence of i.e., overflow to produce large negative numbers, and to see if the darshan-parser
agrees with PyDarshan bindings perhaps.
It looks like there may be another problem with this log (either something orthogonal, or possibly the root cause of record corruption elsewhere). My first attempt to parse the log manually using a current origin/main build produced the following:
carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06> ~/working/install/bin/darshan-parser moti_nek5000_id11006669-50626_5-31-893-3074083273138734090_1.darshan |grep HEAT
# HEATMAP module: 172850 bytes (compressed), ver=1
AddressSanitizer:DEADLYSIGNAL
=================================================================
==625539==ERROR: AddressSanitizer: SEGV on unknown address 0x632000050000 (pc 0x7fc67814bda6 bp 0x63200004ffff sp 0x7ffdae52b858 T0)
==625539==The signal is caused by a WRITE memory access.
#0 0x7fc67814bda6 (/lib/x86_64-linux-gnu/libz.so.1+0x3da6) (BuildId: e25bfc3865cecab5886349610fb620ce21758d11)
#1 0x7fc6781577de in inflate (/lib/x86_64-linux-gnu/libz.so.1+0xf7de) (BuildId: e25bfc3865cecab5886349610fb620ce21758d11)
#2 0x7fc6781a9b2f in darshan_log_libz_read ../../darshan-util/darshan-logutils.c:1681
#3 0x7fc6781a934c in darshan_log_dzread ../../darshan-util/darshan-logutils.c:1567
#4 0x7fc67819ff5b in darshan_log_get_mod ../../darshan-util/darshan-logutils.c:789
#5 0x7fc6781d22b8 in darshan_log_get_lustre_record ../../darshan-util/darshan-lustre-logutils.c:95
#6 0x557a1d9ad4a4 in main ../../darshan-util/darshan-parser.c:395
#7 0x7fc677023a8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
#8 0x7fc677023b48 in __libc_start_main_impl ../csu/libc-start.c:360
#9 0x557a1d9ab604 in _start (/home/carns/working/install/bin/darshan-parser+0x4604) (BuildId: 56758761c6dea4840907f84f13af5d89cc9b1f3e)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libz.so.1+0x3da6) (BuildId: e25bfc3865cecab5886349610fb620ce21758d11)
==625539==ABORTING
My Darshan build is using gcc 12.2 and address sanitizer. Address sanitizer is triggering a segfault while darshan-parser is extracting Lustre module data, which happens before it gets to the heatmap module.
I can't validate yet, but it's possible that memory corruption is leading to the bogus heatmap data. I'll try hacking up the parser to extract only the heatmap data and see what it looks like.
Ok, that was a red herring. When I hacked up darshan-parser to skip straight to the heatmap module I definitely see negative values:
carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06> ~/working/install/bin/darshan-parser moti_nek5000_id11006669-50626_5-31-893-3074083273138734090_1.darshan |grep HEATMAP |grep -v "^#" | cut -f 5 |sort -n |uniq -c |head -n 20
8 -9223372036854775808
4 -9223372036854774784
7 -9223372036854773760
1 -9223372036854772736
3 -9223372036854771712
3 -9223372036854769664
1 -9223372036854767616
1 -9223372036854765568
1 -9223372036854764544
2 -9223372036854762496
1 -9223372036854761472
1 -9223372036854758400
527058 0
2049 0.400000
1 11
1 26
1 29
1 32
1 74
1 130
It looks like a little over 30 bins have negative values.
I tried this little test program to see what the max value of an int64 counter is:
carns-x1-7g ~/tmp> cat foo.c e: darshan-dev
#include <stdint.h>
#include <limits.h>
#include <stdio.h>
int main(int argc, char **argv) {
printf("int64 max: %ld\n", INT64_MAX);
}
carns-x1-7g ~/tmp> ./foo e: darshan-dev
int64 max: 9223372036854775807
The negative values here look like they are very close to the most negative values possible on a signed 64 bit integer. It's impossible for the application to have accessed enough data to simply wrap around though. This has to be a bug in Darshan or some sort of corner case in the function arguments we are tracking.
Another piece of the puzzle: all of these faulty values are in the stdio heatmap:
> ~/working/install/bin/darshan-parser moti_nek5000_id11006669-50626_5-31-893-3074083273138734090_1.darshan |grep HEATMAP |grep -v "^#" | grep "-"
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_1 -9223372036854775808 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_110 -9223372036854775808 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_112 -9223372036854773760 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_113 -9223372036854769664 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_114 -9223372036854765568 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_115 -9223372036854758400 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_116 -9223372036854764544 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_117 -9223372036854774784 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_118 -9223372036854771712 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_119 -9223372036854761472 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_120 -9223372036854773760 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_121 -9223372036854769664 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_122 -9223372036854771712 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_123 -9223372036854762496 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_124 -9223372036854767616 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_125 -9223372036854774784 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_126 -9223372036854773760 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_127 -9223372036854772736 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_128 -9223372036854775808 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_129 -9223372036854769664 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_130 -9223372036854762496 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_131 -9223372036854774784 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_132 -9223372036854773760 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_133 -9223372036854775808 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_134 -9223372036854775808 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_135 -9223372036854775808 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_136 -9223372036854773760 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_137 -9223372036854773760 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_138 -9223372036854775808 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_139 -9223372036854773760 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_140 -9223372036854774784 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_141 -9223372036854771712 heatmap:STDIO UNKNOWN UNKNOWN
HEATMAP 0 3989511027826779520 HEATMAP_WRITE_BIN_142 -9223372036854775808 heatmap:STDIO UNKNOWN UNKNOWN
I think we might need a runtime reproducer to track this down. I'd like to see what the arguments look like on the way in to the heatmap_update()
calls. Feels like a bad cast somewhere.
Nafiseh also provided a log from a different run that does not exhibit this problem (this is again with the parser hacked to only show the heatmap data):
carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06> ~/working/install/bin/darshan-parser moti_nek5000_id11407458-51659_8-29-42872-3372088521547566391_5.darshan |grep -v "^#" | grep "-"
carns-x1-7g ~/w/d/darshan-nafiseh-negative-heatmap-2023-06 [0|0|1]>
So this is possibly an application problem, but it would still be good to know if there was something we could have done to prevent bogus data from getting to the log or if it was some sort of uncontrollable problem (like a memory corruption).