DeLUCS

This repository contains all the source files required to run DeLUCS (https://doi.org/10.1101/2021.05.13.444008) a deep learning clustering tool for DNA sequences, as well as a detailed guide for running the code.

Computational Pipeline:

1. Build the dataset:

	python build_dp.py --data_path=<PATH_sequence_folder>

Input: Folders with the sequences in FASTA format
Output : file in the form (label,sequence,accession)

2. Compute the mimic sequences.

  python get_pairs.py --data_path=<PATH_pickle_dataset> --k=6 --modify='mutation' --output=<PATH_output_file> --n_mimics=<n mimics per sequence>

Input: file in the form (label,sequence,accession)
Output : file in the form of (pairs, x_test, y_test)

3. Train the model.

For training DeLUCS to cluster your own data (No ground truth available):
```
 python TrainDeLUCS.py --n_clusters=<number of clusters> --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR> 
```
- Input: Pickle file with the mimics (Always ending with /testing_data.p) .
- Output : Pickle file with the cluster assignments for each sequence.
For training DeLUCS and testing its performance with your own data (labels must be available)
```
 python EvaluateDeLUCS.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
```
- Input: Pickle file with the mimics in the form of (pairs, x_test, y_test).
- Output : Confusion Matrix.

For training a single Neural Network in an unsupervised way:

 python SingleRun.py --n_clusters=<number of clusters> --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>

For testing the performance a single Neural Network trained in an unsupervised way (labels must be available):
```
 python EvaluateSingleRun.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
```

jorgeavilacartes/DeLUCS

DeLUCS

Computational Pipeline:

1. Build the dataset:

2. Compute the mimic sequences.

3. Train the model.