Deep learning and SVM models for kmer-based enhancer classification. Utility programs for data processing are also provided. Usage of each program (excluding wrapper scripts) can be displayed by typing a command line without any option.
python3, g++, perl, tensorflow, sklearn, numpy, pandas
- mask_seq.pl Mask regions of chromosomal sequences: repeat sequences by 'N's, and other masked sequences by 'X's.
- paral_mask_seq_*.pl Wrapper to run mask_seq.pl.
- divide_200bp.pl Divide enhancer chromosomal positions to 210bp windows.
- retrieve_sequence.pl Retrieve DNA sequence given chromosomal positions.
- filter_seq.pl Filter out fastq sequence overlapped with any masked nucleotide, and generate GC content information for the kept sequences.
- scan_sel_new.cc Randomly select n-fold DNA sequences from the genome by matching given sequence lengths, repeat contents and CG contents in a feature file.
- paral_scan_sel_new_*.pl Wrapper to run scan_sel_new.
- get_kmer_dict.pl Generate kmer dictionaries between kmers and fold-changes by comparing between positive and control sequences.
- paral_get_kmer_dict.pl Wrapper to run get_kmer_dict.pl.
- make_fasta_cv.py Make cross-validation datasets for a pair of positive and negative fasta files. Note that headers and their corresponding sequences are placed in the same lines in the output files.
- code_seq.pl Transform DNA sequences (flattened fasta files) to kmer fold changes for deep learning, based on provided kmer dictionaries.
- paral_code_seq.pl* Wrapper to run code_seq.pl.
- random_selection.py Randomly select a pre-defined proportion of samples from feature files.
- SeqEnhMLP.py Train and test a multi-layer perceptron model of enhancer classifier.
- SeqEnhCNN.py Train and test a CNN model of enhancer classifier.
- SeqEnhRNN.py Train and test an RNN model of enhancer classifier.
- multi_ml_enhancer.py Train and test an enhancer classifier using a conventional machine learning model.
Testing data can be download from http://www.bdxconsult.com/SeqEnhDL.
Positive and negative features should be stored in the same files and training and testing data should be separated. Class labels should be provided as dummy variables in separate files. Format for features
5mer_fc_nt1 7mer_fc_nt1 9mer_fc_nt1 11mer_fc_nt1 ... 5mer_fc_nt200 7mer_fc_nt200 9mer_fc_nt200 11mer_fc_nt200
5mer_fc_nt1 7mer_fc_nt1 9mer_fc_nt1 11mer_fc_nt1 ... 5mer_fc_nt200 7mer_fc_nt200 9mer_fc_nt200 11mer_fc_nt200
Format for class labels
0 1
1 0
To know how to run each enhancer classifier program, users can just type "python3 program_name". For each program, four input files are required, including training features (trnX), training labels (trnY), testing features (tstX), testing lables (tstY). Users need to make sure these files are provided in the correct order or with right foregoing arguments.
Accuracy (and AUC) will be printed after an enhancer classifier is successfully trained and tested.