Improving data quality of taxonomic assignments in large-scale public databases
To run data quality analysis at large-scale, we used BoaG for the computationally expensive part. Then, we used current library in Python and Jupyter notebook for the postprocessing.
Boag is a domain-specific language and infrastructure on top of Hadoop for genomics data. Website: https://boalang.github.io/bio/
BoaG compiler is written in Java and the source code is available here
- This is a video on step by step instructions to set up programming environment on Eclipse for Boa compiler. link
- Lineage
- Provenance
- Construct Tree with ETE3 library
- Identifying conflicts
- Output: List of misclassified sequences
- This file shows list of conflicts. Sequence ID, Cluster DI, Sequence assignment, top3 assignments of the clusters
are shown along with confidence score for the proposed assignment in the next line. See example:
1A0Q 55656088 [('10090', 1)] [('562', 24), ('168807', 2), ('405955', 1)] CS= 0.8888888888888888
- This file shows list of conflicts. Sequence ID, Cluster DI, Sequence assignment, top3 assignments of the clusters
are shown along with confidence score for the proposed assignment in the next line. See example:
- Output: List of misclassified sequences
Following works on detecting and correcting misclassifications in rRNA sequences.
- https://github.com/amkozlov/mislabels16-data
- https://peerj.com/articles/5030/#supplemental-information
- The entire dataset 119 million sequences: https://www.uniprot.org/uniref/?query=&fil=identity:0.9
Following are examples of misclassifications in the 90% clusters
root conflict in UniRef90_I3TC36
cellular organisms conflict UniRef90_I3TC36
superkingdom conflict UniRef90_I3TC36
phylum conflict UniRef90_I3TC36
take 1M sample and check for common1 and common2
python ~/Documents/MyGithub/docs/nr_functions/seq_clstr_conflict.py /Users/hbagheri/Downloads/nr_protein_functions/95-part-r-00000clustr-seq ../boag-job82-output.txt_converted nr_single_taxa_converted_1M > log_conf_1M