Incorrect Timing of File Close when Using NetCDF4
yzanhua opened this issue · 0 comments
Summary
NetCDF4 can perform parallel IO using parallel HDF5. When using Darshan to capture a NetCDF4 application's I/O behavior, I observe that the actual file close is delayed from nc_close
call to MPI_Finalize
. The incorrect timing of file close will affect the correctness of Log VOL who needs to use/release HDF5 resources at file close time, some of which are not available at MPI_Finalize
(e.g. H5T_STD_B8LE).
Reproduce
Test program
test.c
is a simple NetCDF4 programs that open a NetCDF4 file and close directly. It also prints a string application: nc_close start
and application: nc_close end
before and after nc_close
.
Click here to see test.c
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <netcdf.h>
#include <netcdf_par.h>
#define FATAL_ERR {if(err!=NC_NOERR) {printf("Error at line=%d: %s Aborting ...\n", __LINE__, nc_strerror(err)); goto fn_exit;}}
#define ERR {if(err!=NC_NOERR)printf("Error at line=%d: %s\n", __LINE__, nc_strerror(err));}
int main(int argc, char** argv)
{
const char* filename="testfile";
int err;
int ncid, cmode;
MPI_Init(&argc, &argv);
/* create a new file for writing ----------------------------------------*/
cmode = NC_NETCDF4 | NC_CLOBBER | NC_MPIIO;
err = nc_create_par(filename, cmode, MPI_COMM_WORLD, MPI_INFO_NULL, &ncid); FATAL_ERR
/* exit define mode */
err = nc_enddef(ncid); ERR
/* close the file */
printf("========= application: nc_close start\n");
err = nc_close(ncid); ERR
printf("========= application: nc_close end\n");
fn_exit:
MPI_Finalize();
return 0;
}
Compile and Run
Makefile
is provided below. make
to compile the program. make withdarshan
and make nodarshan
will run the program with/without darshan. Note that the Passthrough VOL is enabled so that a message can be printed when the actual file close happens. Passthrough VOL comes together with HDF5 installation, but we need to add CFLAGS="-DENABLE_PASSTHRU_LOGGING"
when installing HDF5 in order to enable printing. The programs runs with 1 MPI process.
Click here to see Makefile
DARSHAN_DIR=${LOCAL_HOME}/Darshan/3.4.2/lib/libdarshan.so
HDF5_DIR=${LOCAL_HOME}/HDF5/1.14.0
NETCDF_DIR=${LOCAL_HOME}/NetCDF/install
all:
mpicc test.c -g -o test \
-I${NETCDF_DIR}/include \
-L${NETCDF_DIR}/lib -lnetcdf
withdarshan:
HDF5_PLUGIN_PATH=${HDF5}/lib \
LD_LIBRARY_PATH=${NETCDF_DIR}/lib:${HDF5_DIR}/lib \
HDF5_VOL_CONNECTOR="pass_through under_vol=0;under_info={}" \
mpirun -n 1 -env LD_PRELOAD="${DARSHAN_DIR}" ./test
nodarshan:
HDF5_PLUGIN_PATH=${HDF5}/lib \
LD_LIBRARY_PATH=${NETCDF_DIR}/lib:${HDF5_DIR}/lib \
HDF5_VOL_CONNECTOR="pass_through under_vol=0;under_info={}" \
mpirun -n 1 ./test
clean:
rm -rf testfile core.* test
Outputs
The outputs for both darshan and no-darshan are below. They are expected to be the same but if Darshan is not enabled, we can see that PASS THROUGH VOL FILE Close
occurs between application: nc_close start/end
. And if Darshan is enabled, PASS THROUGH VOL FILE Close
occurs after application: nc_close end
.
Click here to see the no-darshan (expected) output
HDF5_PLUGIN_PATH=/lib \
LD_LIBRARY_PATH=/files2/scratch/zhd1108/NetCDF/install/lib:/files2/scratch/zhd1108/HDF5/1.14.0/lib \
HDF5_VOL_CONNECTOR="pass_through under_vol=0;under_info={}" \
mpirun -n 1 ./test
------- PASS THROUGH VOL INIT
------- PASS THROUGH VOL INFO String To Info
------- PASS THROUGH VOL INFO Copy
------- PASS THROUGH VOL INFO Copy
------- PASS THROUGH VOL FILE Create
------- PASS THROUGH VOL INFO Copy
------- PASS THROUGH VOL INFO Copy
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL INFO Copy
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL INTROSPECT OptQuery
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL File Optional
------- PASS THROUGH VOL WRAP Object
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL GROUP Open
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL FILE Get
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL ATTRIBUTE Specific
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL ATTRIBUTE Create
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL ATTRIBUTE Write
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL ATTRIBUTE Close
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL FILE Specific
------- PASS THROUGH VOL WRAP CTX Free
========= application: nc_close start
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL ATTRIBUTE Specific
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL FILE Specific
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL H5Gclose
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL FILE Close
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL UNWRAP Object
------- PASS THROUGH VOL WRAP CTX Free
========= application: nc_close end
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL TERM
Click here to see output if darshan is enabled
HDF5_PLUGIN_PATH=/lib \
LD_LIBRARY_PATH=/files2/scratch/zhd1108/NetCDF/install/lib:/files2/scratch/zhd1108/HDF5/1.14.0/lib \
HDF5_VOL_CONNECTOR="pass_through under_vol=0;under_info={}" \
mpirun -n 1 -env LD_PRELOAD="/files2/scratch/zhd1108/Darshan/3.4.2/lib/libdarshan.so" ./test
------- PASS THROUGH VOL INIT
------- PASS THROUGH VOL INFO String To Info
------- PASS THROUGH VOL INFO Copy
------- PASS THROUGH VOL INFO Copy
------- PASS THROUGH VOL FILE Create
------- PASS THROUGH VOL INFO Copy
------- PASS THROUGH VOL INFO Copy
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL INFO Copy
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL INTROSPECT OptQuery
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL File Optional
------- PASS THROUGH VOL WRAP Object
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL GROUP Open
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL FILE Get
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL ATTRIBUTE Specific
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL ATTRIBUTE Create
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL ATTRIBUTE Write
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL ATTRIBUTE Close
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL FILE Specific
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL OBJECT Get
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL Get object
========= application: nc_close start
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL ATTRIBUTE Specific
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL FILE Specific
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL OBJECT Get
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL Get object
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL H5Gclose
------- PASS THROUGH VOL WRAP CTX Free
========= application: nc_close end
------- PASS THROUGH VOL WRAP CTX Get
------- PASS THROUGH VOL FILE Close
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL UNWRAP Object
------- PASS THROUGH VOL WRAP CTX Free
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL INFO Free
------- PASS THROUGH VOL TERM
Library Version
- HDF5 1.14.0. configured with
--enable-parallel
,--enable-build-mode=debug
, andCFLAGS="-DENABLE_PASSTHRU_LOGGING"
- NetCDF 4.9.1. configured with
--disable-dap --disable-mmap --disable-nczarr --disable-byterange
. (some configure options here are necessary to avoid known compiling issues with HDF5 1.14.0) - Darshan 3.4.2
(I tested that using HDF5 1.13.2 and NetCDF 4.9.0 can also reproduce the problem.)
Other findings
The problem can be reproduced without the use of a Passthrough VOL. We can add a print statement for the info->count
in the HDF5 source codes here. It shows the reference count of an object. If Darshan is not enabled, the reference count is 1
at the time nc_close
calls H5Fclose
. If Darshan is enabled, the reference count is 3
so it thinks someone else is still accessing the file and will delay the actual close to very end. I am not sure whether Darshan holds an extra reference to the file or it is as issue more related to NetCDF4. Using HDF5 directly (no NetCDF4 involved) does not give this issue.