medvedevgroup/TwoPaCo

Corrupt input when using graphdump on TwoPaCo output

Closed this issue · 5 comments

rob-p commented

Hi guys,

First of all; awesome work on TwoPaCo --- the method and the software are both fantastically useful, and I'm really excited to start using it for some downstream applications we're working on. I'm running into the following issue. I used TwoPaCo to build the compacted dBG for the human transcriptome (the following command):

twopaco  -k 31 -t 8 -f 32 gencode.v25.pc_transcripts.fa

As the name suggests, the reference is protein coding human transcripts from gencode v25. This seems to work fine, and I get the following output from TwoPaCo:

Threads = 8
Vertex length = 31
Hash functions = 5
Filter size = 4294967296
Capacity = 2
Files:
/mnt/scratch6/avi/data/txptome/gencode.v25.pc_transcripts.fa
--------------------------------------------------------------------------------
Round 0, 0:4294967296
Pass    Filling Filtering
1       9       16
2       3       1
True junctions count = 358144
False junctions count = 56685
Hash table size = 414829
Candidate marks count = 2551662
--------------------------------------------------------------------------------
Reallocating bifurcations time: 0
True marks count: 2540661
Edges construction time: 6
--------------------------------------------------------------------------------
Distinct junctions = 358144

Now, I want to convert this output to a GFA format (I tried both GFA1 and 2 and get the same error in each case). I used the following command:

graphdump -k 31 -s gencode.v25.pc_transcripts.fa -f gfa1 de_bruijn.bin > gencode.twopaco.gfa1

This results in the following error message:

error: The input is corrupted

At this point, some output has been generated, but I presume it's not complete because, despite the fact that there are ~96k input transcripts, I only get 35,451 output paths (i.e., P) entries in the resulting GFA file. Any idea what might be causing this issue or how to fix it?

Thanks!
Rob

Hi Rob, glad that you find TwoPaCo useful. I will take a look shortly and let you know.

Are there any sequences shorther than 31 in the input file?

rob-p commented

Hi Ilyan,

There are 11 such sequences in this file. I understand that, clearly, they cannot be represented in the 31-mer dBG. Do I need to remove them before feeding the file to TwoPaCo? I'll check to see if this resolves the graphdump issue.

Thanks,
Rob

rob-p commented

@IlyaMinkin,

Indeed, this resolves the issue, and graphdump runs to completion now without error. Thanks for pointing this out. I should have checked for such sequences before processing the file. Might I make the suggestion that TwoPaCo emit a warning when it encounters such records, to notify the user that they might cause issues?

Thanks!
Rob

@rob-p Not handling those sequences properly is clearly a bug on TwoPaCo's side. Thank you for reporting it. I am not sure about the proper program's behaviour in that case. Probably, ignoring those and showing a warning to the user, as you suggested.