kiharalab/DiffModeler

read score issue

Closed this issue · 13 comments

I am getting this issue

Traceback (most recent call last):
  File "main.py", line 126, in <module>
    fit_structure_chain(diff_trace_map,fitting_dict,fitting_dir,params)
  File "/home/jadolfbr/DiffModeler/modeling/fit_structure_chain.py", line 75, in fit_structure_chain
    read_score(new_score_dict,pdb_dir,output_path)
  File "/home/jadolfbr/DiffModeler/modeling/score_utils.py", line 40, in read_score
    listoldpdb = [x for x in os.listdir(pdb_dir) if ".pdb" in x]
FileNotFoundError: [Errno 2] No such file or directory: '/home/jadolfbr/DiffModeler/Predict_Result/6824/structure_modeling/A/fit_experiment_0/PDB'

Last things that were printed are as follows:

origin          : (79., 42., 38.)
map             : b'MAP '
machst          : [68 68  0  0]
rms             : 0.4092388451099396
nlabl           : 1
label           : [b'Created by mrcfile.py                                       2024-02-08 17:59:07 '
 b'' b'' b'' b'' b'' b'' b'' b'' b'']
/home/jadolfbr/DiffModeler/Predict_Result/6824/structure_assembling/iterative_B existed
WARNING: Use StructureBlurrer.gaussian_blur_real_space_box()to blured a map with a user defined defined cubic box

Thank you for your interest in DiffModeler! I think your vesper fitting part failed. I would guess you did not configure VESPER well to run.
Could you please provide your output results (all under Predict_Result) to us? You can send your zipped results to my email wang3702@uw.edu. Alternatively, you can also paste the output log of vesper here:/home/jadolfbr/DiffModeler/Predict_Result/6824/structure_modeling/A/vesper_simu_output_*.out. If you do not find such output, then your VESPER failure is confirmed. I think we would also need your command line and output files for us to debug.

Also, feel free to use our server https://em.kiharalab.org/algorithm/DiffModeler. I never saw such errors on our server yet.

Thanks for the quick reply! I cloned VESPER, but when I ran DiffModeler, it seemed like it was automatically getting it to work? I didn't see any instructions for VESPER here in the DiffModeler page - did I miss it somewhere? Maybe I can try to properly get that setup, re-run, and if still problems, wills end the results?

Alright, so the output is 1.9GB. We may need to grab specific components of the outputs. I have the logs (and the vesper_simu output) and pretty much everything up to where it crashed.

What would be most helpful to send?

Alright, so I see all the setup in VESPER_CUDA. Part of the setup has it being part of a different env, conda activate vesper_cuda, are you activating this env in DiffModeler script, or are you using the same DiffModeler env to call both?

It will automatically configured. You do not need to configure it again.
Then please share us this file: /home/jadolfbr/DiffModeler/Predict_Result/6824/structure_modeling/A/vesper_simu_output_*.out.

Also, could you please list all the generated files under /home/jadolfbr/DiffModeler/Predict_Result/6824/structure_modeling/A/. Could you also please provide your command line to run DiffModeler?

Here is the full output of the error after running multiple times and confirming that the GPU is blocking. This error hits every time (4 separate runs), and only one person (me) is using this GPU, confirmed by Nvidia-smi.

cmd:
python3 main.py --mode=0 -F=example/6824.mrc -P=example -M=example/input_info.txt --config=config/diffmodeler.json --contour=2 --gpu=0 --resolution=5.8

Full error:

sampling loop time step: 100%|██████████| 100/100 [00:19<00:00,  5.04it/s]Traceback (most recent call last):
  File "/home/jadolfbr/DiffModeler/VESPER_CUDA/main.py", line 183, in <module>
    fitter = MapFitter(
  File "/home/jadolfbr/DiffModeler/VESPER_CUDA/fitter.py", line 147, in __init__
    self.ldp_atoms = torch.from_numpy(np.array(ldp_atoms)).to(self.device)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Traceback (most recent call last):
  File "main.py", line 126, in <module>
    fit_structure_chain(diff_trace_map,fitting_dict,fitting_dir,params)
  File "/home/jadolfbr/DiffModeler/modeling/fit_structure_chain.py", line 75, in fit_structure_chain
    read_score(new_score_dict,pdb_dir,output_path)
[vesper_simu_output_1.txt](https://github.com/kiharalab/DiffModeler/files/14223400/vesper_simu_output_1.txt)

  File "/home/jadolfbr/DiffModeler/modeling/score_utils.py", line 40, in read_score
    listoldpdb = [x for x in os.listdir(pdb_dir) if ".pdb" in x]
FileNotFoundError: [Errno 2] No such file or directory: '/home/jadolfbr/DiffModeler/Predict_Result/6824/structure_modeling/A/fit_experiment_0/PDB'

vesper_simu_output_0.txt
vesper_simu_output_1.txt

Contents of structure modeling:

-rw-r--r--  1 jadolfbr  staff    19K Feb  8 18:22 vesper_simu_output_1.txt
-rw-r--r--  1 jadolfbr  staff   365K Feb  8 18:22 top1.pdb
-rw-r--r--  1 jadolfbr  staff   5.5M Feb  8 18:22 iterative_0_tmp.mrc
-rw-r--r--  1 jadolfbr  staff   5.5M Feb  8 18:22 iterative_0.mrc
drwxr-xr-x  3 jadolfbr  staff    96B Feb  8 18:22 fit_experiment_0
-rw-r--r--  1 jadolfbr  staff    38K Feb  8 18:22 vesper_log
-rw-r--r--  1 jadolfbr  staff    21K Feb  8 18:22 score.pkl


-rw-r--r--  1 jadolfbr  staff   5.5M Feb  8 18:22 iterative_1_tmp.mrc
-rw-r--r--  1 jadolfbr  staff   5.5M Feb  8 18:22 iterative_1.mrc
drwxr-xr-x  3 jadolfbr  staff    96B Feb  8 18:22 fit_experiment_1
-rw-r--r--  1 jadolfbr  staff    19K Feb  8 18:23 vesper_simu_output_0.txt```

It is possible that your GPU's compute mode is in Exclusive process mode. You can check on the right side of the output panel from nvidia-smi command. If it shows E. Process for the specific GPU, then VESPER_CUDA will not be functional because of context sharing across threads.
I think the temporary fix would be either changing the compute mode to default or change the "thread: 6" to "thread: 1" in "vesper" section in config/diffmodeler.json file for diffmodeler. I have no way to validate the latter though.

The upper one is from our VESPER cuda developer. Please see if his suggestion works for you.
If it works, we will update the instructions.

Thanks both. Yes, it is exclusive and because this is an AWS VPCx, there is no way to change this (and it would be not good if it was changed due to spot instances/etc.).

I will try the later and get the code up and running on EC2/SageMaker where I have more control over exclusive modes. Usually you want exclusive modes, so at first I thought this was someone else trying to use the GPU! I will try this next week. Thanks for your timely input, it is very much appreciated!

Let us know if you still encounter problems. Glad to help.

This was indeed the problem. Running on a SageMaker issue that allows concurrency on the GPU, fixed this - though I do wish I could run it on other systems as well. Thanks for the help!