xgtax_logo

Cross-reference of Genomic Taxonomy

About

xgTaxonomy is a new method for metagenomic classification that utilizes data compression algorithms, known as compressors, to classify genomic sequences. Our two-step evaluation process shows that this approach outperforms existing methods in terms of accuracy and reliability. Additionally, combining features from multiple compressors improves classification accuracy by 26,22%. This method offers a promising strategy for improving the accuracy and reliability of metagenomic classification and provides insights into the statistical and algorithmic nature of genomic data.

xgtax_logo

Team

  • Jorge M. Silva1
  • João R. Almeida12
  1. DETI/IEETA, LASI, University of Aveiro, Aveiro, Portugal
  2. University of A Coruña, A Coruña, Spain

Getting Started

Prerequisites

  • Git
  • Docker and Docker-compose (if using the Docker option)

Download Project

Get xgTaxonomy project using:

git clone https://github.com/bioinformatics-ua/xgTaxonomy.git
cd xgTaxonomy/

Using Docker

To perform installation correctly, docker and docker compose must be installed in the system (see https://docs.docker.com/engine/install/ubuntu/).

Then, follow these instructions:

git clone https://github.com/bioinformatics-ua/xgTaxonomy.git
cd xgTaxonomy
docker-compose build
docker-compose up -d && docker exec -it xgTaxonomy bash && docker-compose down

Install Compressors

Give run Install Compressors for Benchmark:

bash install_compressors.sh;

Result Replication

To run the pipeline and obtain all the Reports in the folder reports, use the following commands.

Download sequences I

For obtaining random sequences for baseline test performance run:

cd src/
python3 getSampleSequences.py 

Baseline test

For baseline compression test run:

cd src/
python3 compress_baseline.py

Download sequences II

For obtaining random sequences for taxonomic classification run:

cd src/
python3 getDatabaseSequences.py 

Classifiers

F1-score and accuracy for each compressor

cd src/
python3 classifier.py -b > ../results/f1score_accuracy_single.txt

Classification report for each compressor

cd src/
python3 classifier.py -cr > ../results/classification_reports_single.txt

Classification f1-score and accuracy for all genomic features

cd src/
python3 classifier.py -ag -b > ../results/f1_score_accuracy_all_genome_features.txt
python3 classifier.py -ag -cr > ../results/classification_report_all_genome_features.txt 

Classification f1-score and accuracy for all proteomic features

cd src/
python3 classifier.py -ap -b > ../results/f1_score_accuracy_all_proteome_features.txt
python3 classifier.py -ap -cr > ../results/classification_report_all_proteome_features.txt

Classification report using all compression features

cd src/
python3 classifier.py -cr -ac > ../results/classification_report_all_columns.txt

F1-score and accuracy using all compression features

cd src/
python3 classifier.py -ac -b > ../results/f1score_accuracy_all_columns.txt

Feature selection for f1-score and accuracy

cd src/
python3 classifier.py -fs -ac -b > ../results/feature_selection.txt

Classification f1-score and accuracy for all possible feature combinations (brute force)

cd src/
python3 classifier.py -bf -b > ../results/f1score_accuracy_all_combinations.txt

Classification report for all compressors (brute force)

cd src/
python3 classifier.py -bf -cr > ../results/classification_report_all_combinations.txt

Test Correlation

cd src/
python3 correlateTable.py

Cite

Please cite the following, if you use xgTaxonomy in your work:

in progress

Issues

Please let us know if there are any issues.

License

xgTaxonomy is released under the MIT License.