RasmussenLab/phamb

Parsing deepvirfinder line 512, in _parse_dvf_row contig_name, length, score, pvalue = line[:-1].split()

TomasaSbaffi opened this issue · 2 comments

Hello,

I am really happy to be trying the PHAMB pipeline on my data. I am running it on small co assemblies, I do not have a concatenated assembly but I am running the pipeline separately for each coassembly. Is this a wrong approach?

When I run the RF model I have the following error, given by python:

Parsing deepvirfinder
Traceback (most recent call last):
  ...
  File "path/to/phamb/workflows/mag_annotation/scripts/run_RF_modules.py", line 512, in _parse_dvf_row
    contig_name, length, score, pvalue = line[:-1].split()
ValueError: too many values to unpack (expected 4)`

The head of my clusters.tsv

1	k141_169383 flag=1 multi=4.0000 len=2138
2	k141_566141 flag=1 multi=5.0000 len=1337
3	k141_562874 flag=1 multi=3.0000 len=2128
4	k141_174278 flag=1 multi=3.0000 len=1243
5	k141_155879 flag=1 multi=4.0000 len=1035
6	k141_981516 flag=0 multi=7.5058 len=1355
7	k141_615867 flag=1 multi=3.0000 len=1068
8	k141_749989 flag=1 multi=4.0000 len=1960
9	k141_945068 flag=0 multi=15.6210 len=2455
10	k141_1091919 flag=0 multi=5.9626 len=1318

the head of my all.DVF.predictions.txt

name	len	score	pvalue
k141_344865 flag=1 multi=4.0000 len=1127	1127	6.64381843762385e-07	0.8834881788654733
k141_620757 flag=0 multi=3.7828 len=1260	1260	0.061418987810611725	0.2213724601556009
k141_298883 flag=1 multi=3.0000 len=1290	1290	0.013160040602087975	0.3235138605634867
k141_390848 flag=1 multi=2.0790 len=1179	1179	0.6529936790466309	0.036823022886924996
k141_206919 flag=0 multi=10.9103 len=1479	1479	1.0	0.0
k141_505802 flag=1 multi=25.0000 len=1881	1881	0.08912927657365799	0.196616058614699
k141_1057576 flag=1 multi=3.0000 len=1049	1049	0.635226845741272	0.038635848629050534
k141_896644 flag=0 multi=200.6066 len=1872	1872	0.9405460357666016	0.01478585995921142
k141_1034585 flag=0 multi=3.0000 len=1245	1245	0.9999510645866394	0.0011518996903089357

Is it due to the 4 columns composing the name of the contigs? Any suggestions?

Thanks again for the great pipeline!

Hi @TomasaSbaffi

Thanks for trying out Phamb!
If you ran Vamb seperately for each coassembly, it makes sense to run Phamb seperately for each coassembly as well.

Now to your problem: It is the naming of your contigs that produce the error, specifically the "spaces" in the fasta header.
I would recommend renaming your contigs and replace spaces with "_" not only to make this parsing script work but many other bioinformatic tools do not work properly with spaces in fasta headers either.

The name change should look like this:
k141_1091919 flag=0 multi=5.9626 len=1318 -> k141_1091919_flag=0_multi=5.9626_len=1318

I
Best,
Joachim

Thank you very very much!!