daniel-koehn/DENISE-Black-Edition

Exception ("merge: can't read model file !") in mergemod.c

pplotn opened this issue · 8 comments

Sometimes, during my using of Denise PSV I get following error ("merge: can't read model file !") in mergemod.c.
What can be the reasons for this?
I am using 12 nodes 32 cpu each. NPROCX=4,NPROCY=4

**Message from mergemod (printed by PE 0):
PE 0 starts merge of 16 model files
writing merged model file to ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin
Opening model files: ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin.??? ... finished.
Copying... ... finished.
Use
ximage n1=384 < ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin label1=Y label2=X title=./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin
to visualize model.

PE 0 is writing model to
./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.0.0

**Message from mergemod (printed by PE 0):
PE 0 starts merge of 16 model files

writing merged model file to ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin
Opening model files: ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.??? Message from PE 0
R U N - T I M E E R R O R:
merge: can't read model file !
...now exiting to system.

-rw-r--r-- 1 plotnips k1404 0 May 19 22:17 modelTest_rho_stage_1_it_10.bin
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.0
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.1
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.2
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.3
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.0
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.1
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.2
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.3
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.0
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.1
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.2
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.3
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.0
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.1
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.2
-rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.3

Hi Pavel,

Assuming that you used 16 CPU cores for the parallelization with domain decompositon, the remaining cores are used for shot parallelization. How many shots are you modelling in total? Are they dividible by 24 without any remainder? Does the problem also occur when using less cores for the shot parallelization, or in the extreme case only using the domain decomposition?

Best regards,

Daniel

Hello Daniel,
I am modeling 51 shots.
As I understand, I use 4*4=16 cores per shot.
Overall, I have 12*32=384 cores.
It means, that I parallelize over 384/16=24 shots.
It means, I need 3 iterations to go through al the 51 shots.

This exception is very rare, I don't get it for other model size and number of shots.

20320209ws_fwi_3_strategy_51_Overthrust_true.err.txt
20320209ws_fwi_3_strategy_51_Overthrust_true.out.txt

Hi Pavel,

I have the suspicion, that one problem when using shot parallelization might be, that non-merged model files are removed in
PSV/model_it_out_PSV:

https://github.com/daniel-koehn/DENISE-Black-Edition/blob/master/src/PSV/model_it_out_PSV.c

Try to comment or delete all remove() functions in model_it_out_PSV.c and recompile the source code, before running the code again. If this is indeed the issue, similar problems will occur in gauss_filt.c and gauss_filt_var.c

Best regards,

Daniel

Ok, thanks Daniel. I recompiled the code and the problem still occurs on the same velocity model. Though on other models it is not happening.

PE 0 is writing model to
./fwi/ws_fwi_3_strategy_55/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.0.0
**Message from mergemod (printed by PE 0):
PE 0 starts merge of 16 model files

writing merged model file to ./fwi/ws_fwi_3_strategy_55/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin
Opening model files: ./fwi/ws_fwi_3_strategy_55/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.??? Message from PE 0
R U N - T I M E E R R O R:
merge: can't read model file !
...now exiting to system.

Hello, in my experience setting Nprocx and Nprocy helps to get rid of this error.
It works with parallelization by shots enabled.

Increasing stringsize variable in fd.h file helped.

That makes sense. If the stringsize of the model name and directory are longer than the pre-defined maximum stringsize in fd.h, the numbering of the domain decomposition might be missing in the file name extension of the model files. Therefore, the mergemod function will fail to merge the model files from the different sub-domains correctly. Thank you for finding this bug, Pavel.

Yes, Daniel.
I have a bit complicated paths to my folders. So I increased STRINGSIZE to 150.