/quality

Improving data quality of taxonomic assignments in large-scale public databases

Primary LanguageJupyter Notebook

Data Quality of NR database

Improving data quality of taxonomic assignments in large-scale public databases

Method

Implementation

To run data quality analysis at large-scale, we used BoaG for the computationally expensive part. Then, we used current library in Python and Jupyter notebook for the postprocessing.

Boag is a domain-specific language and infrastructure on top of Hadoop for genomics data. Website: https://boalang.github.io/bio/

BoaG compiler is written in Java and the source code is available here

  • This is a video on step by step instructions to set up programming environment on Eclipse for Boa compiler. link

Step1: Script and analysis on BoaG infrastructure

Step2: Postprocessing in Python and Jupyter Notebooks

Dataset

Clustering information

Evaluation

Simulated dataset

Manual Analysis

Literature dataset

Following works on detecting and correcting misclassifications in rRNA sequences.

UniProt --UniRef90 (clusters at 90% sequence similarity)

Following are examples of misclassifications in the 90% clusters

root conflict in  UniRef90_I3TC36
cellular organisms conflict UniRef90_I3TC36
superkingdom conflict UniRef90_I3TC36
phylum conflict UniRef90_I3TC36

Run time

take 1M sample and check for common1 and common2 python ~/Documents/MyGithub/docs/nr_functions/seq_clstr_conflict.py /Users/hbagheri/Downloads/nr_protein_functions/95-part-r-00000clustr-seq ../boag-job82-output.txt_converted nr_single_taxa_converted_1M > log_conf_1M