Thoughts about compressing unitigs?
rchikhi opened this issue · 2 comments
Hi Sebastian, Agnieszka, Heng,
AGC looks great. I wanted to see if it'd work also on badly-assembled sequences, e.g. unitigs, and didn't get good compression ratios. Would you say the approach fundamentally wouldn't work for unitigs, or did I miss some parameter tweaks?
I tried to compress 2 human samples unitigs (NA06986 & NA06991) using CHM13v2 as reference, resulting in AGC filesize of 3.6 GB, which is more than the concatenation of the raw gzipped unitigs (2x1.7GB). Cmdline: \time ~/tools/agc/agc create -t 10 chm13v2.0.oneline.fa NA06986.unitigs.fa.gz NA06991.unitigs.fa.gz > NA06986_NA06991.agc
. Testing with parameter -s 200
didn't substantially change results.
thanks in advance for any feedback,
Rayan
Hi Rayan,
AGC was designed for high quality assemblies. Nevertheless, I'm a bit surprised that you report so bad ratios, so we have to take a look at this case. Definitely, we should be better than gzip. :-) I'll let you know when we will have any news.
Best,
Sebastian
AGC does look great! And perhaps I misunderstood, but I think the size difference is due to the AGC file including three genomes (i.e., ref + 2 unitig assemblies), not just two. So AGC would still effectively be smaller at 3.6GB than concatenating the three assemblies.