============    DeepNeuralNet_QSAR Documentation     ==============

Authors: Yuting Xu, Junshui Ma. 

Contact: yuting.xu@merck.com, junshui_ma@merck.com.

Affiliation: Merck Biometrics Research, Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc., Kenilworth, NJ, USA.

Date: 02/07/2017

	This set of codes were developed based on George Dahl's Kaggle codes in Dec. 2012.

If you use the DeepNeuralNet_QSAR for scientific work that gets published, you should include in that publication a citation of the paper below:

Xu, Yuting, Junshui Ma, Andy Liaw, Robert P. Sheridan, and Vladimir Svetnik. "Demystifying Multitask Deep Neural Networks for Quantitative Structure–Activity Relationships." Journal of chemical information and modeling 57, no. 10 (2017): 2490-2504.

Basic info.

System requirements:
* Python 2.7+
* Required Python Modules: 
  - Python Modules installed by default: sys, os, argparse, itertools, gzip, time
  - General Python Modules:	numpy, scipy.sparse 
  - Special Python Modules: gnumpy, cudamat (if use GPU) or npmat (if use multiplec-core CPU)
* CUDA toolkit: a prerequisite of cudamat Python Module.

Installation of Special Python Modules:
	* gnumpy: http://www.cs.toronto.edu/~tijmen/gnumpy.html
	* npmat: http://www.cs.toronto.edu/~ilya/npmat.py
	* cudamat: https://github.com/cudamat/cudamat

  - Modules "gnumpy" and "npmat" are also provided in this distribution.
  - If you have not GPU card or have problem installing cudamat module, the npmat.py module will use multiplec-core CPU to simulate the GPU computing. 
  - Create a directory for this moduel of DeepNeuralNet_QSAR, and keep all the python scripts in that directory. 

* Start a commandline-window (in windows) or a terminal (in linux), and run the python scripts. Please refer to details below.

Brief explaination of all python files
All the files are listed in alphabetical order, not ordered by importance.
Please find more detailed comments of all individual functions inside each python file.

	Define several classes of common activiation functions, such as ReLU/Linear/Sigmoid, along with their derivation or error function (if used for ouput layer).
	Used by [dnn.py]

	Utilize sys.stderr to produce progress bar for each training epoch.
	Include several different classes of progress bar, but only "Progress" and "DummyProgBar" are used.
	Used by [dnn.py]

	For making predictions for new compound structure with a single-task/multi-task DNN, which is trained by DeepNeuralNetTrain.py or DeepNeuralNetTrain_dense.py. 

	For training a multi-task/single-task DNN with sparse QSAR dataset(s), accepts raw csv datasets or processed npz datasets.

	For training a multi-task DNN with dense QSAR dataset(s), accepts raw csv datasets or processed npz datasets.
	Key components of a simple feed forward neural network.
	Used by [DeepNeuralNetTrain.py], [DeepNeuralNetPredict.py], [DeepNeuralNetTrain_multi.py] and [DeepNeuralNetPredict_multi.py]

	A group of assistant functions, such as calculating R-squared, writing predictions into file. 
	Used by many other files in the package.

	A simple python module for GPU computing, the "GPU-version" of numpy module. 

	A simple python module which is required by gnumpy.py for the simulation mode. 
	If failed to import cudamat, using npmat (CPU computing) instead. 

[processData_sparse.py], [processData_dense.py]	
	Pre-processing a group of raw csv QSAR data sets(either sparse or dense) to sparse-matrix python file format (save as *.npz), 
	to facilitate later use.
	Contains many data-manipulation functions used by other files in the package.
How to use - Example scripts
0) Prepare input datasets
	[sparse datasets]
	* Arrange all the datasets as examples in "data_sparse" folder.
	* Example #1 (It is a subset of three tasks from the 15 Kaggle datasets): 
		 - Folder name: data_sparse
		 - Contains several datasets, each has training set and test set: 
				METAB_training.csv METAB_test.csv   
				OX1_training.csv   OX1_test.csv   
				TDI_training.csv   TDI_test.csv  
	* Example #2 (It is a single task selected from Kaggle datasets): 
		 - Folder name: data_sparse_single
		 - Contains one pair of training set and test set:
				METAB_training.csv METAB_test.csv   

	[dense datasets]
	* Arrange all the datasets as examples in "data_dense_raw" folder.
	* Example (It is a subsample from CYP datasets, which has 3 tasks): 
		 - Folder name: data_dense
		 - Contains two datasets, one training set and one test set: 
				training.csv  test.csv  

1) Pre-process data (Optional, can be skipped.)
	* preprocess sparse format datasets: create a new folder "data_sparse" under the working directory to save processed data.
		python processData_sparse.py data_sparse data_sparse_processed

	* preprocess dense format datasets: create a new folder "data_dense" under the working directory to save processed data, need to tell how many tasks are there in the dense dataset, such as "3" in the example datasets. 
		python processData_dense.py data_dense data_dense_processed 3

2) Train a single-task DNN for one QSAR task

	Default transformation of inputs is log; activation function is ReLU, minibatch size 128....

	The key parameters that need to be specify by user: 
	 - seed: random seed for the program. It is optional but better to be given for reproducibility. 
	 - CV: (optional) proportation of cross-validation subset which randomly sampled from training set
	 - test: (optional) whether to use the corresponding external test set for checking performance on test set during training.
	 - hid: DNN structure, specify the number of nodes at each layer. 
	 - dropouts: the drop out probability for each layer, to prevent over-fitting. 
	 - epochs: number of epochs for training
	 - data: path to the folder which contains a single QSAR task data, could contain raw csv file or processed npz file
	 - the last argument: where you want to save the trained model, if the folder doesn't exists it'll be created automatically

	* Example: use .csv raw data to train a single-task DNN for METAB, each corresponding processed .npz files will be automatically save to input data path
		python DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single

	* Example: use .npz processed data to train a single-task DNN for METAB (recommended, loading data faster than raw data)
	Parameters are the same as above. The processed datasets in folder "data_sparse_single" is created in last step.
		python DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single

	* Example: Without the optional 'CV' and 'test' arguments.
		python DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single

3) Prediction with a single-task DNN
	The key parameters that need to be specify by user: 
	 - model: the path to previous trained model folder, e.g. the "models/METAB_single" from step 2). 
	 - data: path to the folder which contains a single QSAR task data, could contain raw csv file or processed npz file
	 - label: whether the "test" dataset have true label. Default is 0, but in this example it has true label. 
	 - rep: (optional) number of dropout prediction rounds. Default is 0, means don't perform dropout prediction.
	 - seed: random seed for the program, useful for dropout prediction. Optional but better to be given for reproducibility. 
	 - result: (optional) specify where to save the prediction results. Default is the same as model folder.

	* Example: use the previous trained single DNN model for METAB to perform prediction for its test data
		python DeepNeuralNetPredict.py --seed=0 --label=1 --rep=10 --data=data_sparse_single --model=models/METAB_single --result=predictions/METAB_single

	* Example: Without the optional 'rep' and 'PredictResultPath':
		python DeepNeuralNetPredict.py --label=1 --data=data_sparse_single --model=models/METAB_single

4) Train a multi-task DNN for the sparse datasets
	Need to use the processed datasets but not raw datasets.
	Parameters that are different from single-task DNN:
	 - data: path to the data folder that stores all the QSAR datasets
	 (Below are optional)
	 - mbsz: the minibatch size, default is 20, but for multi-task it may be modified to achieve better results
	 - keep: the datasets to keep in the model, if don't want to include all datasets in the 'data' folder
	 - watch: if use internal cross-validation set or external test set, choose to monitor the MSE and R-squared for certain task
	 - reducelearnRateVis: sometimes reduce the learning rate of the first layer helps the training process to converge better

	* Example: a multi-task DNN to model all the three sparse datasets: METAB, OX1, TDI
		python DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=5 --data=data_sparse models/multi_sparse_1
	* Example: load the previous trained model and continue the training process for more epochs. 
		python DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse --loadModel=models/multi_sparse_1 models/multi_sparse_continue

	* Example: with more optional parameters, keep only METAB and OX1 tasks and monitor OX1 task performance
		python DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --mbsz=30 --keep=METAB --keep=OX1 --watch=OX1 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse models/multi_sparse_2

5) Prediction with multi-task DNN for the sparse datasets
	The parameter settings are the same as single-task DNN for sparse dataset. See step 3).
	Only difference:
	- data: path to the data folder that stores all the processed datasets (including test datasets).

	* Example: prediction for all the three sparse datasets with the model trained in previous step, save results to model folder:
		python DeepNeuralNetPredict.py --label=1 --data=data_sparse --model=models/multi_sparse_1

	* Example: prediction with the model for METAB and OX1, trained in previous step, with dropout prediction, and save result to another folder.
		python DeepNeuralNetPredict.py --label=1 --seed=0 --rep=10 --data=data_sparse --model=models/multi_sparse_2 --result=predictions/multi_sparse_2

6) Train a multi-task DNN for the dense datasets
	Most of the parameter settings are the same as multi-task DNN for sparse datasets
	Difference: use integer parameters for the 'keep' and 'watch' arguments
	The key parameters that need to be specify by user: 
	 - numberOfOutputs: number of QSAR task output columns in the raw training set (.csv)

	* Example: keep only the first two output tasks and monitor the first output during training process, with internal cross-validation set and external test set, using raw data
		python DeepNeuralNetTrain_dense.py --numberOfOutputs=3 --CV=0.4 --test --keep=0_1 --watch=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_dense models/multi_dense_1

	* Example: Without the optional arguments, using pre-processed data
	Note: for processed data, don't need to specify "--numberOfOutputs=3"
		python DeepNeuralNetTrain_dense.py --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_dense_processed models/multi_dense_2

7) Prediction with multi-task DNN for the dense datasets
	Parameter settings are the same as prediction for sparse datasets

	* Example: Prediction using trained DNN from previous step
		python DeepNeuralNetPredict.py --label=1 --dense --data=data_dense --model=models/multi_dense_1 --result=predictions/multi_dense_1