RasmussenLab/phamb

Error while running RF model

Closed this issue ยท 5 comments

Hi. I am getting an error when I run the RF model.

I used the following command to start the run (I am using the latest version of phamb)

python /Phamb_new/mag_annotation/scripts/run_RF.py /Vamb/contigs.flt.fna.gz /Vamb/vamb/clusters.tsv /lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/annotations /Phamb_new/Result_dir

I got the following error.

Traceback (most recent call last):
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF.py", line 227, in <module>
    viral_annotation = run_RF_modules.Viral_annotation(annotation_files=viral_annotation_files,genomes=reference)
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF_modules.py", line 358, in __init__
    self._parse_viralannotation_file(filetype.lower(),file)
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF_modules.py", line 386, in _parse_viralannotation_file
    annotation_tuple = parse_function(line)
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF_modules.py", line 513, in _parse_dvf_row
    score =round(float(score),2)
ValueError: could not convert string to float: 'score'

This is how my "all.DVF.predictions.txt" file looks like.

name    len     score   pvalue
S10CNODE_1_length_374305_cov_118.066653 374305  0.4933076500892639      0.06760329330009819
S10CNODE_2_length_331174_cov_150.761282 331174  0.5215792059898376      0.05410151824155903
S10CNODE_3_length_327615_cov_134.196242 327615  0.6207031011581421      0.03997658433416421
S10CNODE_4_length_275508_cov_107.113522 275508  0.3987869620323181      0.09687287559483344
S10CNODE_5_length_273839_cov_39.234849  273839  0.37943029403686523     0.10166931037087393
S10CNODE_6_length_265257_cov_21.606357  265257  0.7501952648162842      0.029231815091774305
S10CNODE_7_length_254430_cov_27.129502  254430  0.6598391532897949      0.036350932849913135
S10CNODE_8_length_239244_cov_15.625518  239244  0.5251834392547607      0.05332729058085958
S10CNODE_9_length_235224_cov_151.910707 235224  0.4213518500328064      0.09149104917289826

Can you please help me in solving this?

Bhim

enryH commented

Can you try to delete the first header line? As I read the error the program fail when it tries to convert score to a float value.

float("score") # fails
float(0.4933076500892639) # should work

Best, Henry

Dear Henry,

Thank you very much for your quick reply.

I tried your solution but still got the same error.

Traceback (most recent call last):
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF.py", line 227, in <module>
    viral_annotation = run_RF_modules.Viral_annotation(annotation_files=viral_annotation_files,genomes=reference)
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF_modules.py", line 358, in __init__
    self._parse_viralannotation_file(filetype.lower(),file)
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF_modules.py", line 386, in _parse_viralannotation_file
    annotation_tuple = parse_function(line)
  File "/lustre7/home/bhimbiswa/MAGs/Virus/Phamb_new/mag_annotation/scripts/run_RF_modules.py", line 513, in _parse_dvf_row
    score =round(float(score),2)
ValueError: could not convert string to float: 'score'.

As you suggested, I removed the header line of "all.DVF.predictions.txt".

S10CNODE_1_length_374305_cov_118.066653 374305  0.4933076500892639      0.06760329330009819
S10CNODE_2_length_331174_cov_150.761282 331174  0.5215792059898376      0.05410151824155903
S10CNODE_3_length_327615_cov_134.196242 327615  0.6207031011581421      0.03997658433416421
S10CNODE_4_length_275508_cov_107.113522 275508  0.3987869620323181      0.09687287559483344
S10CNODE_5_length_273839_cov_39.234849  273839  0.37943029403686523     0.10166931037087393
S10CNODE_6_length_265257_cov_21.606357  265257  0.7501952648162842      0.029231815091774305
S10CNODE_7_length_254430_cov_27.129502  254430  0.6598391532897949      0.036350932849913135
S10CNODE_8_length_239244_cov_15.625518  239244  0.5251834392547607      0.05332729058085958
S10CNODE_9_length_235224_cov_151.910707 235224  0.4213518500328064      0.09149104917289826

Regards,
Bhim

Hi Bhim

Did you remove all header lines in the all.DVF.predictions.txt file? if you concatenated a bunch of DeepVirFinder files you will still have headers in multiple lines of the file.

Can you try something like this to make sure there are no headers left:
grep -v "pvalue" all.DVF.predictions.txt > all.DVF.predictions.NEW.txt
Then use the all.DVF.predictions.NEW.txt as input instead.

Let me know if it works.

Best,
Joachim

I solved this problem by concatenating my dvf files using the code below, keeping only the first header:
awk '(NR == 1) || (FNR > 1)' {input.dvf} > {output.dvf}
input.dvf is a list of all the input files and output.dvf is the concatenated file with only a single header.

Dear Joachim,

Thank you very much for your kind help.

I am sorry that I misunderstood Henry suggestion.
I had only removed the header from the first line and didn't realize that the header would be there for all files in the concatenated file.

So, I used the following command as you suggested and started my RF model run with the new file.

grep -v "pvalue" all.DVF.predictions.txt > all.DVF.predictions.NEW.txt

This time RF model run finished successfully.

Parsing deepvirfinder
Parsing voghmm
Parsing micompletehmm
Loading Model and annotation table
Writing: 260460 bins to file

Thank you very much @enryH and @sklucas for your suggestions.