arpcard/rgi

depreciate -split_prodigal_jobs

agmcarthur opened this issue · 3 comments

Analysis of simulated validation data by the CARD team (unpublished) revealed that Prodigal undercalls ORFs when the -split_prodigal_jobs option is used. This was particularly noticeable in a Acinetobacter baumannii genomic context. As this leads to false negatives, please depreciate support of -split_prodigal_jobs in RGI.

Looking at the implementation, that is likely a product of not generating and using a single training file for the whole genome before running the subjob ORF calling. This means each split is tuning the ORF finding model on only the sequence subset it gets thus lower accuracy.

Similar issue can occur if running on a large set of related genomes. Low quality genomes will have even worse ORF calling because the model trained on them will be poorer. Training on all genomes then using that training file would maximise accuracy/consistency (or moving to a ggcaller approach!)

Issue is stale and will be closed in 7 days unless there is new activity

Re-opening to assess if we should handle training better or depreciate.