/DeLUCS

This repository contains all the source files required to run DeLUCS, a deep learning clustering algorithm for DNA sequences.

Primary LanguagePython

DeLUCS

This repository contains all the source files required to run DeLUCS (https://doi.org/10.1101/2021.05.13.444008) a deep learning clustering tool for DNA sequences, as well as a detailed guide for running the code.

drawing

Computational Pipeline:

1. Build the dataset:

	python build_dp.py --data_path=<PATH_sequence_folder>	
  • Input: Folders with the sequences in FASTA format
  • Output : file in the form (label,sequence,accession)

2. Compute the mimic sequences.

  python get_pairs.py --data_path=<PATH_pickle_dataset> --k=6 --modify='mutation' --output=<PATH_output_file> --n_mimics=<n mimics per sequence>
  • Input: file in the form (label,sequence,accession)
  • Output : file in the form of (pairs, x_test, y_test)

3. Train the model.

  • For training DeLUCS to cluster your own data (No ground truth available):

     python TrainDeLUCS.py --n_clusters=<number of clusters> --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR> 
    
    • Input: Pickle file with the mimics (Always ending with /testing_data.p) .
    • Output : Pickle file with the cluster assignments for each sequence.
  • For training DeLUCS and testing its performance with your own data (labels must be available)

     python EvaluateDeLUCS.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
    
    • Input: Pickle file with the mimics in the form of (pairs, x_test, y_test).
    • Output : Confusion Matrix.
  • For training a single Neural Network in an unsupervised way:

     python SingleRun.py --n_clusters=<number of clusters> --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
    
  • For testing the performance a single Neural Network trained in an unsupervised way (labels must be available):

     python EvaluateSingleRun.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>