ANDES is a Python and R implementation of Kanjilal, Campelo dos Santos, Arnab, DeGiorgio and Assis' (2024) suite of methods that merge the power of unsupervised anomaly detection algorithms with feature extraction techniques to identify anomalous regions of the genome associated with various biological factors or technical artifacts.
Thank you for using the ANDES.
If you use this Python and R package, then please cite: R kanjilal, AL Campelo dos Santos, SP Arnab, M DeGiorgio, R Assis. In Preparation
If you find any issues running ANDES, then please contact Ria Kanjilal directly through rkanjilal@fau.edu.
Before you are able to use the ANDES, a Python and R package, you must ensure that the following Python packages are installed in your computational environment:
pip install pandas
pip install numpy
pip install scikit-learn
pip install scipy
pip install scikit-allel
Additionally, you will need to have the following packages installed in R:
install.packages("dplyr")
install.packages("fda")
install.packages("MASS")
install.packages("ff")
The ANDES software package comes with the Python and R scripts, which are configured using a shell file, 'ANDES.sh'. You can run the program with the following command:
./ANDES.sh VCF_file.vcf
The user can provide multiple input, e.g.:
./ANDES.sh VCF_file_A.vcf VCF_file_B.vcf
If execution permission is needed, run the following command:
chmod +x ANDES.sh
python ./vcf_ss_M_features.py --vcf "$@"
This command calls a Python script for generating summary statistics and moment features from the .vcf files.
Input: --vcf will input all vcf files.
The above operation will output two files, one including summary statistics generated from the input .vcf files, and the other including moment features extracted from the summary statistics file.
SS.csv
M_features.csv
Rscript ./ss_fda_features.R
This command calls an R script for generating FDA (Functional Data Analysis) features from the summary statistics file.
The above operation will output the FDA features in the following file:
fda_features.csv
Rscript ./MD_MF_scores.R
This command calls an R script for generating anomaly scores (in the form of p-values) using MD (Mahalanobis Distance) method from the distinct sets of M (moments) and F (FDA) features.
The above operation will output the following files:
MD-M_anomalyscores.csv
MD-F_anomalyscores.csv
python ./IFSVM_training.py
This command calls a Python script to train IF (Isolation Forest) and SVM (One-class Support Vector Machine) algorithms on the distinct sets of M (moments) and F (FDA) features and generate anomaly scores.
The above operation will output the following files:
IFscores_M.csv
IFscores_F.csv
SVMscores_M.csv
SVMscores_F.csv
Rscript ./IFSVM_MF_anomalyscores.R
This command calls an R script for generating IF-M, IF-F, SVM-M, and SVM-F anomaly scores (in the form of p-values) from the scores generated by training and testing IF and SVM algorithms.
The above operation will output the following files:
IF-M_anomalyscores.csv
IF-F_anomalyscores.csv
SVM-M_anomalyscores.csv
SVM-F_anomalyscores.csv
Within the terminal, move to the directory containing both the ANDES Python and R scripts, and the example file CEU22.vcf. Then run ANDES with the following command:
./ANDES.sh CEU22.vcf