NumPy wrapper: sliced reading from file crashes
Closed this issue ยท 6 comments
Hey there,
I encountered a segmentation fault that occurs when reading a slice from a linear AdiosVar
array when the slice contains a certain index. The index can change (checked different files).
The values can be read on their own but not when accessing more than one element.
The ADIOS 1.13.0 I'm using for post-processing has been built with the same flags as the one I used for the creation of the data - just without mpi
. I built the numpy wrapper from source using python setup.py
.
I am using blosc
with zstd
for compression, full parameters: threshold=2048,shuffle=bit,lvl=1,threads=10,compressor=zstd
.
$ adios_config -s
DIR=/users/<USER>/lib/adios-1.13.0_nompi
CFLAGS=-I/users/<USER>/lib/adios-1.13.0_nompi/include -D_NOMPI -DZLIB -I/users/<USER>/lib/zlib-1.2.11/include -DBLOSC -I/users/<USER>/lib/blosc-1.12.1/include -I/users/<USER>/lib/blosc-1.12.1/include
LDFLAGS=-L/users/<USER>/lib/adios-1.13.0_nompi/lib -ladios_nompi -L/users/<USER>/lib/zlib-1.2.11/lib64 -L/users/<USER>/lib/blosc-1.12.1/lib -libverbs -lz -lblosc
Available write methods (in XML <method> element or in adios_select_method()):
"POSIX"
Available read methods (constants after #include "adios_read.h"):
ADIOS_READ_METHOD_BP (=0)
Available data transformation methods (in XML transform tags in <var> elements):
"none" : No data transform
"identity" : Identity transform
"zlib" : zlib compression
"zfp" : zfp compression
"blosc" : blosc compression
Available query methods (in adios_query_set_method()):
ADIOS_QUERY_METHOD_MINMAX (=0)
This is what I observe:
In [1]: import numpy as np
In [2]: import adios as ad
In [3]: path = "014_0060gpus2DCopper30nmLeadingEdge1E-3/simOutput/bp/simData_87040.bp"
In [4]: f = ad.File(path)
In [5]: px = f['/data/87040/particles/H_all/momentum/x']
In [6]: px[3740]
Out[6]: 0.04298079386353493
In [7]: px[3741]
Out[7]: 0.08291889727115631
In [8]: px[3742]
Out[8]: -0.13517844676971436
In [9]: px[3740:3741]
Out[9]: array([ 0.04298079], dtype=float32)
In [10]: px[3741:3742]
Out[10]: array([ 0.0829189], dtype=float32)
In [11]: px[3740:3742]
Segmentation fault (core dumped)
A run with gdb
shows
(gdb) run test.py
Starting program: /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/bin/python test.py
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-61.3.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00002aaab0977f54 in adios_transform_blosc_pg_reqgroup_completed () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
Missing separate debuginfos, use: zypper install libibverbs1-debuginfo-1.2.0-17.1.x86_64 libnl3-200-debuginfo-3.2.23-2.21.x86_64
(gdb) backtrace
#0 0x00002aaab0977f54 in adios_transform_blosc_pg_reqgroup_completed ()
from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#1 0x00002aaab09732b7 in adios_transform_pg_reqgroup_completed () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#2 0x00002aaab0972aba in adios_transform_process_all_reads () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#3 0x00002aaab0944b24 in common_read_perform_reads () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#4 0x00002aaab0937157 in adios_perform_reads () from /users/<USER>/lib/anaconda3/envs/analyzePIConGPU/lib/python3.6/site-packages/adios/adios.cpython-36m-x86_64-linux-gnu.so
#5 0x00002aaab08b321d in __pyx_f_5adios_3var_read (__pyx_v_self=0x2aaab0bf9688, __pyx_skip_dispatch=<optimized out>, __pyx_optional_args=<optimized out>) at adios.cpp:24203
#6 0x00002aaab089ac1e in __pyx_pf_5adios_3var_12read (__pyx_v_step_scalar=<optimized out>, __pyx_v_fill=<optimized out>, __pyx_v_nsteps=<optimized out>, __pyx_v_from_steps=<optimized out>,
__pyx_v_scalar=<optimized out>, __pyx_v_count=<optimized out>, __pyx_v_offset=<optimized out>, __pyx_v_self=0x2aaab0bf9688) at adios.cpp:24495
#7 __pyx_pw_5adios_3var_13read (__pyx_v_self=0x2aaab0bf9688, __pyx_args=<optimized out>, __pyx_kwds=0x2aaaabc60288) at adios.cpp:24461
#8 0x0000555555660364 in _PyCFunction_FastCallDict ()
#9 0x000055555568ef30 in _PyCFunction_FastCallKeywords ()
#10 0x00005555556f2ebc in call_function ()
#11 0x00005555557153e7 in _PyEval_EvalFrameDefault ()
#12 0x00005555556ed8d9 in PyEval_EvalCodeEx ()
#13 0x00005555556ee67c in PyEval_EvalCode ()
#14 0x0000555555768ce4 in run_mod ()
#15 0x00005555557690e1 in PyRun_FileExFlags ()
#16 0x00005555557692e4 in PyRun_SimpleFileExFlags ()
#17 0x000055555576cdaf in Py_Main ()
#18 0x00005555556338be in main ()
What could I specifically look into?
The data set that we read is a 1D array written by several process groups.
At the offset of concern, a process group wrote zero entries. This is an issue we encountered (& fixed) before, e.g. with zlib transforms.
The numpy wrapper version is:
import adios as ad
ad.__version__
'1.13.0'
The blockinfo
from the file shows
In[9]: px.blockinfo
Out[9]:
[[AdiosBlockinfo (process_id=0, time_index=1, start=(0,), count=(19,)),
AdiosBlockinfo (process_id=1, time_index=1, start=(19,), count=(45,)),
AdiosBlockinfo (process_id=2, time_index=1, start=(64,), count=(1477,)),
AdiosBlockinfo (process_id=3, time_index=1, start=(1541,), count=(61,)),
AdiosBlockinfo (process_id=4, time_index=1, start=(1602,), count=(82,)),
AdiosBlockinfo (process_id=5, time_index=1, start=(1684,), count=(1154,)),
AdiosBlockinfo (process_id=6, time_index=1, start=(2838,), count=(22,)),
AdiosBlockinfo (process_id=7, time_index=1, start=(2860,), count=(46,)),
AdiosBlockinfo (process_id=8, time_index=1, start=(2906,), count=(570,)),
AdiosBlockinfo (process_id=9, time_index=1, start=(3476,), count=(18,)),
AdiosBlockinfo (process_id=10, time_index=1, start=(3494,), count=(18,)),
AdiosBlockinfo (process_id=11, time_index=1, start=(3512,), count=(198,)),
AdiosBlockinfo (process_id=12, time_index=1, start=(3710,), count=(16,)),
AdiosBlockinfo (process_id=13, time_index=1, start=(3726,), count=(4,)),
AdiosBlockinfo (process_id=14, time_index=1, start=(3730,), count=(2,)),
AdiosBlockinfo (process_id=15, time_index=1, start=(3732,), count=(4,)),
AdiosBlockinfo (process_id=16, time_index=1, start=(3736,), count=(5,)),
AdiosBlockinfo (process_id=17, time_index=1, start=(3741,), count=(0,)),
AdiosBlockinfo (process_id=18, time_index=1, start=(3741,), count=(2,)),
AdiosBlockinfo (process_id=19, time_index=1, start=(3743,), count=(1,)),
AdiosBlockinfo (process_id=20, time_index=1, start=(3744,), count=(1,)),
AdiosBlockinfo (process_id=21, time_index=1, start=(3745,), count=(0,)),
AdiosBlockinfo (process_id=22, time_index=1, start=(3745,), count=(2,)),
AdiosBlockinfo (process_id=23, time_index=1, start=(3747,), count=(2,)),
AdiosBlockinfo (process_id=24, time_index=1, start=(3749,), count=(1,)),
AdiosBlockinfo (process_id=25, time_index=1, start=(3750,), count=(1,)),
AdiosBlockinfo (process_id=26, time_index=1, start=(3751,), count=(2,)),
AdiosBlockinfo (process_id=27, time_index=1, start=(3753,), count=(0,)),
AdiosBlockinfo (process_id=28, time_index=1, start=(3753,), count=(0,)),
AdiosBlockinfo (process_id=29, time_index=1, start=(3753,), count=(2,)),
AdiosBlockinfo (process_id=30, time_index=1, start=(3755,), count=(0,)),
AdiosBlockinfo (process_id=31, time_index=1, start=(3755,), count=(1,)),
AdiosBlockinfo (process_id=32, time_index=1, start=(3756,), count=(1,)),
AdiosBlockinfo (process_id=33, time_index=1, start=(3757,), count=(0,)),
AdiosBlockinfo (process_id=34, time_index=1, start=(3757,), count=(2,)),
AdiosBlockinfo (process_id=35, time_index=1, start=(3759,), count=(1,)),
AdiosBlockinfo (process_id=36, time_index=1, start=(3760,), count=(2,)),
AdiosBlockinfo (process_id=37, time_index=1, start=(3762,), count=(6,)),
AdiosBlockinfo (process_id=38, time_index=1, start=(3768,), count=(3,)),
AdiosBlockinfo (process_id=39, time_index=1, start=(3771,), count=(3,)),
AdiosBlockinfo (process_id=40, time_index=1, start=(3774,), count=(3,)),
AdiosBlockinfo (process_id=41, time_index=1, start=(3777,), count=(0,)),
AdiosBlockinfo (process_id=42, time_index=1, start=(3777,), count=(7,)),
AdiosBlockinfo (process_id=43, time_index=1, start=(3784,), count=(3,)),
AdiosBlockinfo (process_id=44, time_index=1, start=(3787,), count=(0,)),
AdiosBlockinfo (process_id=45, time_index=1, start=(3787,), count=(10,)),
AdiosBlockinfo (process_id=46, time_index=1, start=(3797,), count=(15,)),
AdiosBlockinfo (process_id=47, time_index=1, start=(3812,), count=(0,)),
AdiosBlockinfo (process_id=48, time_index=1, start=(3812,), count=(7,)),
AdiosBlockinfo (process_id=49, time_index=1, start=(3819,), count=(21,)),
AdiosBlockinfo (process_id=50, time_index=1, start=(3840,), count=(171,)),
AdiosBlockinfo (process_id=51, time_index=1, start=(4011,), count=(7,)),
AdiosBlockinfo (process_id=52, time_index=1, start=(4018,), count=(50,)),
AdiosBlockinfo (process_id=53, time_index=1, start=(4068,), count=(593,)),
AdiosBlockinfo (process_id=54, time_index=1, start=(4661,), count=(34,)),
AdiosBlockinfo (process_id=55, time_index=1, start=(4695,), count=(48,)),
AdiosBlockinfo (process_id=56, time_index=1, start=(4743,), count=(1264,)),
AdiosBlockinfo (process_id=57, time_index=1, start=(6007,), count=(16,)),
AdiosBlockinfo (process_id=58, time_index=1, start=(6023,), count=(28,)),
AdiosBlockinfo (process_id=59, time_index=1, start=(6051,), count=(1537,))]]
First question: I cannot even write a zero-length block with zlib transformation into the output because the write segfaults. How do you produce the file? Do you turn off zlib for the zero blocks?
Hi @pnorbert,
@psychocoderHPC just found the root of the issue and will provide a fix in a few minutes. Affects about half of the transforms: blosc, zlib, bzip2, lz4.
Writing zero-length blocks with transformations is possible since a long time (I think we fixed that together in 1.10 or so) and is an important use case for unstructured, domain-decomposed data. We were writing with blosc (not zlib) where we skip compression on zero-size input in the write transform. Maybe the zlib transform has a bug if it does not do the same - but I seem to remember it worked in the past for us.
Or maybe it's just a misunderstanding of what we do: our overall variable is not zero-sized, it's just individual process groups that contribute zero in parallel writes.
It's just a missing meta-data check on read that is causing the crash right now: see #162