/ANDES

A program that implements the anomaly detection model ANDES of Kanjilal et al. (submitted) for uncovering aberrant genomic regions by modeling genomic autocovariation with functional data analysis.

Primary LanguageR

ANDES

ANDES is a Python and R implementation of Kanjilal, Campelo dos Santos, Arnab, DeGiorgio and Assis' (2024) suite of methods that merge the power of unsupervised anomaly detection algorithms with feature extraction techniques to identify anomalous regions of the genome associated with various biological factors or technical artifacts.


Citing ANDES

Thank you for using the ANDES.

If you use this Python and R package, then please cite: R kanjilal, AL Campelo dos Santos, SP Arnab, M DeGiorgio, R Assis. In Preparation


Reporting issues

If you find any issues running ANDES, then please contact Ria Kanjilal directly through rkanjilal@fau.edu.


Getting started

Before you are able to use the ANDES, a Python and R package, you must ensure that the following Python packages are installed in your computational environment:

pip install pandas
pip install numpy
pip install scikit-learn
pip install scipy
pip install scikit-allel

Additionally, you will need to have the following packages installed in R:

install.packages("dplyr")
install.packages("fda")
install.packages("MASS")
install.packages("ff")

The ANDES software package comes with the Python and R scripts, which are configured using a shell file, 'ANDES.sh'. You can run the program with the following command:

./ANDES.sh VCF_file.vcf

The user can provide multiple input, e.g.:

./ANDES.sh VCF_file_A.vcf VCF_file_B.vcf

If execution permission is needed, run the following command:

chmod +x ANDES.sh

Format of ANDES.sh file


Command #1:

python ./vcf_ss_M_features.py --vcf "$@"

This command calls a Python script for generating summary statistics and moment features from the .vcf files.

Input: --vcf will input all vcf files.

The above operation will output two files, one including summary statistics generated from the input .vcf files, and the other including moment features extracted from the summary statistics file.

SS.csv
M_features.csv

Command #2:

Rscript ./ss_fda_features.R

This command calls an R script for generating FDA (Functional Data Analysis) features from the summary statistics file.

The above operation will output the FDA features in the following file:

fda_features.csv

Command #3:

Rscript ./MD_MF_scores.R

This command calls an R script for generating anomaly scores (in the form of p-values) using MD (Mahalanobis Distance) method from the distinct sets of M (moments) and F (FDA) features.

The above operation will output the following files:

MD-M_anomalyscores.csv
MD-F_anomalyscores.csv

Command #4:

python ./IFSVM_training.py

This command calls a Python script to train IF (Isolation Forest) and SVM (One-class Support Vector Machine) algorithms on the distinct sets of M (moments) and F (FDA) features and generate anomaly scores.

The above operation will output the following files:

IFscores_M.csv
IFscores_F.csv
SVMscores_M.csv
SVMscores_F.csv

Command #5:

Rscript ./IFSVM_MF_anomalyscores.R

This command calls an R script for generating IF-M, IF-F, SVM-M, and SVM-F anomaly scores (in the form of p-values) from the scores generated by training and testing IF and SVM algorithms.

The above operation will output the following files:

IF-M_anomalyscores.csv
IF-F_anomalyscores.csv
SVM-M_anomalyscores.csv
SVM-F_anomalyscores.csv

Example application of ANDES

Within the terminal, move to the directory containing both the ANDES Python and R scripts, and the example file CEU22.vcf. Then run ANDES with the following command:

./ANDES.sh CEU22.vcf