karel-brinda/Phylign

Add LQ genomes to the dataset

leoisl opened this issue · 5 comments

Add LQ genomes to the dataset
  1. step – upload it to Zenodo (will be highly useful for reproducibility; likely will have to be 2-3 zenodo datasets due to the max size limit)

  2. step - adding a param to the pipeline for using the hq_lq version

We should also figure out a consistent naming for these two versions of the 661k dataset.

I was wondering if we could have dustbins for LQ genomes. The reasoning is that every batch has just a handful of LQ genomes, and their inclusion on the batch has two negative impacts: 1. increasing unnecessarily the size of the HQ-genomes bloom filters; 2. frequently breaking the runs of 0s, making the compression less effective. Therefore, I think phylogenetic compression gets actually worst with LQ genomes mixed with HQ ones - we just add a handful of genomes for every batch and makes things worst for all of them. With respect to the usability, I think we always want to query the HQ genomes. Sometimes we want to query the full dataset, including LQ genomes, and if we have separate LQ dustbins batches, we would just need to download a few additional batches and run the search. Therefore, I just see advantages on keeping separate LQ dustbins batches.

I was wondering if we could have dustbins for LQ genomes.

For the future yes, for now for simplicity we should keep everywhere exactly the same batches otherwise it will be a nightmare to debug it. The LQ-inlusive version will be used only for testing, but it's not intended for end users (ideally).

Thanks a lot!