dellacortelab/prospr

Can't get it run with multiple issues(docker, plmDCA and mat file)

Closed this issue ยท 8 comments

I would like to appreciate your amazing work first, but when I try to predict structures on my own sequence it came out with multiple problems
I started by using docker to run it, by running this command:
$ docker run -t -v /path/to/my/data:/1IWA prospr/prospr build 1IWA
And it asks me whether I would like to download uniclust for hhblits, ignoring I actually have a fully decompressed uniclust database under data/hhblits, and after I hit "y" to this question, it got stucked there without doing anything, the cpu load is 0, and the downloading process wasn't going on.

Then I tried to run it from the python code, I ran this command:
python prospr.py build 1IWA
and it still says I don't have hhblits file while I do:
FileNotFoundError: [Errno 2] No such file or directory: 'hhblits': 'hhblits'
If I use the previously build pssm file to generate pkl file, what I get is:

[2021-01-20 17:27:32.312146] Building pssm file
pssm file exists, would you like to use it? [y,n] y
[2021-01-20 17:27:34.194319] . . .pssm file completed
[2021-01-20 17:27:34.194375] building final pkl sequence.
Traceback (most recent call last):
File "/home/zyxue/anaconda3/envs/prosprenv/lib/python3.6/site-packages/scipy/io/matlab/mio.py", line 39, in _open_file
return open(file_like, mode), True
FileNotFoundError: [Errno 2] No such file or directory: '/home/zyxue/Record/prosprdata/1IWA/1IWA.mat'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "prospr.py", line 126, in
build(args)
File "prospr.py", line 112, in build
s.build(args)
File "/home/zyxue/Record/prospr-master/prospr/io.py", line 39, in build
s.potts()
File "/home/zyxue/Record/prospr-master/prospr/io.py", line 93, in potts
potts = loadmat(filename)
File "/home/zyxue/anaconda3/envs/prosprenv/lib/python3.6/site-packages/scipy/io/matlab/mio.py", line 222, in loadmat
with _open_file_context(file_name, appendmat) as f:
File "/home/zyxue/anaconda3/envs/prosprenv/lib/python3.6/contextlib.py", line 81, in enter
return next(self.gen)
File "/home/zyxue/anaconda3/envs/prosprenv/lib/python3.6/site-packages/scipy/io/matlab/mio.py", line 17, in _open_file_context
f, opened = _open_file(file_like, appendmat, mode)
File "/home/zyxue/anaconda3/envs/prosprenv/lib/python3.6/site-packages/scipy/io/matlab/mio.py", line 45, in _open_file
return open(file_like, mode), True
FileNotFoundError: [Errno 2] No such file or directory: '/home/zyxue/Record/prosprdata/1IWA/1IWA.mat'

I'm getting really confused on how to generate this mat file, and how can I fix it. I wonder if anyone can provide advice and any kind of help would be appreciated!

Do you try my branch & code?
https://github.com/yamule/prospr#prospr-using-pre-computed-input-files
If you don't have cuda, change "cuda:0" to "cpu".

  • Firstly, try run.py using
    python run.py run -n nn/ProSPr_full.nn -p example_files/2E74/2E74_D.pdb_d0.fas.jackali.max.ascii -b example_files/2E74/2E74_D.pdb_d0.fas.jackali.max.hhm -m example_files/2E74/2E74_D.pdb_d0.fas.jackali.max.dcares.dat.mat -g "cuda:0" -o testout.dat -f example_files/2E74/2E74_D.pdb_d0.fas.jackali.max.tmp.fas
  • Secondly, try using files in example_files/1L9Y/
    1L9Y_A.pdb_d0.msa will be used to generate .mat file with code in
    https://github.com/yamule/prospr/tree/sep/potts_PlmDCA.jl
    (please follow howto.txt) and
    python run.py run -n nn/ProSPr_full.nn -p example_files/1L9Y/1L9Y_A.pdb_d0.pssm.ascii -b example_files/1L9Y/1L9Y_A.pdb_d0.hhm -m <matfile you generated> -g "cuda:0" -o testout2.dat -f example_files/1L9Y/1L9Y_A.pdb_d0.fas
  • Thirdly, try with any sequence which you want to predict.

Do you try my branch & code?
https://github.com/yamule/prospr#prospr-using-pre-computed-input-files
If you don't have cuda, change "cuda:0" to "cpu".

  • Firstly, try run.py using
    python run.py run -n nn/ProSPr_full.nn -p example_files/2E74/2E74_D.pdb_d0.fas.jackali.max.ascii -b example_files/2E74/2E74_D.pdb_d0.fas.jackali.max.hhm -m example_files/2E74/2E74_D.pdb_d0.fas.jackali.max.dcares.dat.mat -g "cuda:0" -o testout.dat -f example_files/2E74/2E74_D.pdb_d0.fas.jackali.max.tmp.fas
  • Secondly, try using files in example_files/1L9Y/
    1L9Y_A.pdb_d0.msa will be used to generate .mat file with code in
    https://github.com/yamule/prospr/tree/sep/potts_PlmDCA.jl
    (please follow howto.txt) and
    python run.py run -n nn/ProSPr_full.nn -p example_files/1L9Y/1L9Y_A.pdb_d0.pssm.ascii -b example_files/1L9Y/1L9Y_A.pdb_d0.hhm -m <matfile you generated> -g "cuda:0" -o testout2.dat -f example_files/1L9Y/1L9Y_A.pdb_d0.fas
  • Thirdly, try with any sequence which you want to predict.

Thanks for your reply Yamule! I checked your branch and the howto.txt and it mentioned using the .msa file as input to get the .mat file, may I wonder how you generated the .msa file from the sequence, it should be a multi-sequence alignment file, right?
Also, I noticed in your run command you used .pssm.ascii file as the result for psiblast, may I wonder why there's an ascii suffix?
Thanks again for your kind help!

may I wonder how you generated the .msa file from the sequence, it should be a multi-sequence alignment file, right?

Yes, it is an aligned fasta file. I forgot how it was generated, but I think I made it with hhblits or jackhmmer because it must have been originally created in a3m format and lowercase letters were removed. (If you don't know what I'm talking about, read the hhsuite user guide for a3m format https://github.com/soedinglab/hh-suite/releases/tag/userguide .)
.msa for 1L9Y_A.pdb_d0 is provided as
https://github.com/yamule/prospr/blob/sep/example_files/1L9Y/1L9Y_A.pdb_d0.msa
thus you don't need to build it by yourself. Please check whether you can run PlmDCA.jl successfully in step 2.
Please let me know which step you are stuck on or you could go through all the steps.

may I wonder why there's an ascii suffix?

It's just my preference.
psi-blast generates pssm in two different formats. One is ASN.1 and the other is ASCII text.
Therefore, I usually append "ascii" to the latter file-format. Please check -out_ascii_pssm and -out_pssm option of psi-blast.

Hi Yamule,
Thank you so much for your help, I managed to use your julia plmDCA got .mat file and made predictions! Although the output of julia plmDCA seems not recognizable by the original prospr code so I borrowed your code again and got the predictions! Thanks a lot for your sharing! @yamule

Nice.

Although the output of julia plmDCA seems not recognizable by the original prospr code so I borrowed your code again and got the predictions!

Yes, because their shapes are different.
I forget the details, but in the Matlab version, the 0 and the 21st elements are used for gaps or Xs or something.
In the Julia version, only the 21st element is served for irregular condition.
(CCMPred matrix has a different amino acid index, so it's quite difficult to use CCMPred, i guess)

Nice.

Although the output of julia plmDCA seems not recognizable by the original prospr code so I borrowed your code again and got the predictions!

Yes, because their shapes are different.
I forget the details, but in the Matlab version, the 0 and the 21st elements are used for gaps or Xs or something.
In the Julia version, only the 21st element is served for irregular condition.
(CCMPred matrix has a different amino acid index, so it's quite difficult to use CCMPred, i guess)

Thanks!
BTW, I noticed there's a prediction file for the phi and psi angles among residues in the output. However, that seems to be the score vector directly output from the network instead of the predicted value. I didn't see there's any description on that in the authors' original document, do you know how to convert that to real angles?

I didn't see there's any description on that in the authors' original document, do you know how to convert that to real angles?

Oh, yes. My script will output it, but the original code will not. So, I don't recommend to use it.
Also, other text files are also arranged for ease of use. It is not the official & original result files.
To get the original output, please use the --raw option.
In any case, the output of the 36 bins should correspond to angles of -180 to +180 degrees, so the first bin will be -180 to -170 degrees, and the second bin will be -170 to -160 degrees and so forth.
In this way, we end up with the following result
numpy.argmax(the_result_array_of_bins_for_a_residue)*10 - 180 +5
(I added +5 to get the center of the bins.)

I see, I will see how it performs compared to real value. Thanks for your reply!