Parsing deepvirfinder line 512, in _parse_dvf_row contig_name, length, score, pvalue = line[:-1].split()
TomasaSbaffi opened this issue · 2 comments
Hello,
I am really happy to be trying the PHAMB pipeline on my data. I am running it on small co assemblies, I do not have a concatenated assembly but I am running the pipeline separately for each coassembly. Is this a wrong approach?
When I run the RF model I have the following error, given by python:
Parsing deepvirfinder
Traceback (most recent call last):
...
File "path/to/phamb/workflows/mag_annotation/scripts/run_RF_modules.py", line 512, in _parse_dvf_row
contig_name, length, score, pvalue = line[:-1].split()
ValueError: too many values to unpack (expected 4)`
The head of my clusters.tsv
1 k141_169383 flag=1 multi=4.0000 len=2138
2 k141_566141 flag=1 multi=5.0000 len=1337
3 k141_562874 flag=1 multi=3.0000 len=2128
4 k141_174278 flag=1 multi=3.0000 len=1243
5 k141_155879 flag=1 multi=4.0000 len=1035
6 k141_981516 flag=0 multi=7.5058 len=1355
7 k141_615867 flag=1 multi=3.0000 len=1068
8 k141_749989 flag=1 multi=4.0000 len=1960
9 k141_945068 flag=0 multi=15.6210 len=2455
10 k141_1091919 flag=0 multi=5.9626 len=1318
the head of my all.DVF.predictions.txt
name len score pvalue
k141_344865 flag=1 multi=4.0000 len=1127 1127 6.64381843762385e-07 0.8834881788654733
k141_620757 flag=0 multi=3.7828 len=1260 1260 0.061418987810611725 0.2213724601556009
k141_298883 flag=1 multi=3.0000 len=1290 1290 0.013160040602087975 0.3235138605634867
k141_390848 flag=1 multi=2.0790 len=1179 1179 0.6529936790466309 0.036823022886924996
k141_206919 flag=0 multi=10.9103 len=1479 1479 1.0 0.0
k141_505802 flag=1 multi=25.0000 len=1881 1881 0.08912927657365799 0.196616058614699
k141_1057576 flag=1 multi=3.0000 len=1049 1049 0.635226845741272 0.038635848629050534
k141_896644 flag=0 multi=200.6066 len=1872 1872 0.9405460357666016 0.01478585995921142
k141_1034585 flag=0 multi=3.0000 len=1245 1245 0.9999510645866394 0.0011518996903089357
Is it due to the 4 columns composing the name of the contigs? Any suggestions?
Thanks again for the great pipeline!
Thanks for trying out Phamb!
If you ran Vamb seperately for each coassembly, it makes sense to run Phamb seperately for each coassembly as well.
Now to your problem: It is the naming of your contigs that produce the error, specifically the "spaces" in the fasta header.
I would recommend renaming your contigs and replace spaces with "_" not only to make this parsing script work but many other bioinformatic tools do not work properly with spaces in fasta headers either.
The name change should look like this:
k141_1091919 flag=0 multi=5.9626 len=1318 -> k141_1091919_flag=0_multi=5.9626_len=1318
I
Best,
Joachim
Thank you very very much!!