snRNA-snoRNA-Classification

What's in the dir 'Dataset'?

original datasets -- *.txt snRNA/snoRNA/tRNA/rRNA sequences of human/mouse.

final.pkl.gz -- Human snRNA/snoRNA dataset

final_m.pkl.gz -- Mouse snRNA/snoRNA dataset

final_tr.pkl.gz -- Human rRNA/tRNA dataset Used to pollute snRNA/snoRNA dataset

final_m_tr.pkl.gz -- Mouse rRNA/tRNA dataset same as above

dirty.pkl.gz -- Human snRNA/snoRNA/trRNA Polluted dataset.

 dirty_m.pkl.gz -- Mouse snRNA/snoRNA/trRNA Polluted dataset.

What's in the dir 'Code'?

saved models (maybe) -- xxx_model_xxx CNN model parameters are saved in these dir's.

dataset.py -- data operation All data related works are done in this script. Try not to do anything with it, if you don't understand -- it may deystroy the datasets in Dataset.

CNN.py CNN_pred.py -- CNN model Not directly used in the experiment, but all other CNN files are derived from them. (It's really bothering to write a class, so I just copy and paste and modify several parts in each CNN file)

CNN_cv.py CNN_cv_pred.py -- CNN model with CV k-fold CV applied while training, and the k classifiers vote when making predictions. use final.pkl.gz or final_m.pkl.gz when running these two scripts.

dirty_CNN_cv.py dirty_CNN_cv_pred.py -- CNN model with dirty samples add a new class, 'none-of-above'. use dirty.pkl.gz to train, and dirty_m.pkl.gz to test.

plot.py, plot_cv.py -- plot curves to plot the learning curves generated when training.