/CASTOR_KRFE

Alignment-free method to identify and analyse discriminant genomic subsequences within pathogen sequences

Primary LanguagePythonMIT LicenseMIT

CASTOR-KRFE

  • CASTOR-KRFE v1.3 Help file
  • K-mers based feature identifier for viral genomic classification
  • Copyright (C) 2023 Dylan Lebatteux, Amine M. Remita, Abdoulaye Banire Diallo
  • Author : Dylan Lebatteux, Amine M. Remita
  • Contact : lebatteux.dylan@courrier.uqam.ca

Description

CASTOR-KRFE is an alignment-free method to identify a set of k-mers to discriminate between groups of genomic sequences. The core of CASTOR-KRFE is based on feature elimination using Support Vector Machines (SVM-RFE) which is an machine learning feature selection method. CASTOR-KRFE identifies an optimal length of k to maximize classification performance and minimize the number of features. The extracted set of k-mers can be used to build a prediction model. This model can then be used to predict a set of new genomic sequences. A new module allowing to identify discriminative k-mers variations and their associated information according to the sequence class has also been included.

Required softwares

Parameters

List of parameters requiring adjustment in the configuration_file.ini :

  • k_min : Minimum length of k-mers
  • k_max : Maximum length of k-mers
  • T : Percentage performance threshold (T = 0.99 is recommended) .
  • training_fasta : Training fasta file path
  • testing_fasta : Testing fasta file path
  • reference_sequence : Path of the reference sequence in GenBank format
  • k_mers_path : Path file of the extracted k-mers
  • model_path : Path file of the prediction model
  • prediction_path : Path of the sequence prediction file
  • evaluation_mode : Evaluation mode during the prediction (True/False).

Utilization

  1. Specify the parameters of the previous section in the configuration_file.ini.
  2. Run the following command :
$ python main.py configuration_file.ini
  1. Select an option:
  • 1)Extract k-mers | Required parameters: T, k_min, k_max, training_fasta and k_mers_path
  • 2)Fit a model | Required parameters: training_fasta, k_mers_path and model_path
  • 3)Predict a sequences | Required parameters: testing_fasta, k_mers_path, model_path, prediction_path and evaluation_mode
  • 4)Motif analyzer | Required parameters: training_fasta, k_mers_path and reference_sequence
  • 5)Exit/Quit

Fasta file format example for n sequences:

>id_sequence_1|target_sequence_1 
CTCAACTCAGTTCCACCAGGCTCTGTTGGATCCGAGGGTAAGGGCTCTGTATTTTCCTGC 
>id_sequence_2|target_sequence_2						
CTCAACTCAGTTCCACCAGGCTCTGTTGGATCCGAGGGTAAGGGCTCTGTATTTTCCTGC
...
...
...
>>id_sequence_n-1|target_sequence_n-1									 
CTCAACTCAGTTCCACCAGGCTCTGTTGGATCCGAGGGTAAGGGCTCTGTATTTTCCTGC 
>id_sequence_n|target_sequence_n													 
CTCAACTCAGTTCCACCAGGCTCTGTTGGATCCGAGGGTAAGGGCTCTGTATTTTCCTGC 
  • The character "|" is used to separate the sequence ID from its target.
  • The target must be specified in the fasta file for a prediction with evaluation_mode = True.
  • For more detailed examples see the data sets in the Data folder

Output

  • k_mers.fasta: File of the extracted k-mers list
  • model.pkl : Prediction model generated by CASTOR-KRFE
  • Prediction.csv : Results file of the prediction of unknown genomic sequences
  • Signature_location.xlsx : Analysis report associated with a signature

Reference to cite CASTOR-KRFE

Reference to cite KANALYZER (Option 4: Motif analyzer)