Quick rejection of entire batches
leoisl opened this issue · 5 comments
Some queries will have sequences belonging just to a single species, or a lineage of a single species. So, in principle, we should just query batches of that species. We can do this by simply computing an union BF from all samples in the batch. We can afford to have a very-low-FP union BF for each batch with low disk/RAM usage (e.g. a 10-MB union BF for each batch, amounting for 3050 MB for all batches).
The pipeline would thus be:
- Match all queries against all COBS union BFs (1 COBS job);
- For each batch, produce a query file containing queries that map to that batch;
- Resume the pipeline normally;
As such, we just run COBS search on a given batch for the queries that showed some evidence they might be in that batch. This might not work well on dustbins, but on all the rest should work fine.
Hi @leoisl , this is something we should definitely think about in the future. Thanks for the suggestion!
It might actually require computing also an intersection – this together could give us rigorous lower and upper estimates.
I am actually thinking on implementing this soon, as I might have an use case where the query is just a handful of sequences of specific species that has to be completed in seconds/few minutes...
Of interest, we just map plasmids to 25 target batches (see #52 (comment)):
target_batches 25
But it seems we match to all batches except 7, so this might not help much with the plasmid search (unless we increase COBS kmer threshold)...
(base) leandro@leandro-OptiPlex-7060:~/Downloads/mof-search-fix-191/intermediate/01_match$ zgrep -c -v "^*" *.gz | awk 'BEGIN{FS=":"}{if ($2==0){print $0}}'
burkholderia_ubonensis__01____all_ebi_plasmids.gz:0
cutibacterium_acnes__01____all_ebi_plasmids.gz:0
dichelobacter_nodosus__01____all_ebi_plasmids.gz:0
prochlorococcus_marinus__01____all_ebi_plasmids.gz:0
taylorella_equigenitalis__01____all_ebi_plasmids.gz:0
treponema_pallidum__01____all_ebi_plasmids.gz:0
wolbachia_endosymbiont_of_drosophila_melanogaster__01____all_ebi_plasmids.gz:0
I'm putting this into the IDEAS-FOR-THE-FUTURE category.