HaploKit/vechat

Two ideas on optimization

Opened this issue · 1 comments

For large eukaryotic genomes, the file overlap.paf may be very large. I think, VeChat can be optimized in two ways to deal with this:

  1. Instead of making overlap.paf, it can make overlap.paf.gz . This can be achieved by compressing the output of fpa with " | gzip -1 >". Racon can take gzipped files with overlaps as input.
  2. It's probably worth to add a parameter that sets the minimum overlap length. If reads' N50 is, for example, 20 kbp, the minimum overlap can be safely raised from the default 500 bp to, for example, 5000 bp. It will not only decrease the size of the paf file, but also probably accelerate the error correction by avoiding consideration of short overlaps.

Hello @HaploKit ! Vechat is a super tool that greatly helps to resolve some complex cases, but is computation- and space-epensive. Is gzipped input considered to be implemented?