Where's the Question? A Multi-channel Deep Convolutional Neural Network for Question Identification in Textual Data
(accepted paper in The 3rd Clinical Natural Language Processing Workshop)
This is the code that was used of the paper : Where's the Question? A Multi-channel Deep Convolutional Neural Network for Question Identification in Textual Data where we created a multi-channel convolutional neural network for the seperation of sentences to question, not-questions and c-questions questions referring to an issue mentioned in a nearby sentence (e.g., can you clarify this?)
This project was created with python 3.7 and PyTorch 0.4.1
We provide code of the following models:
- Quest_CNN: code for Where's the Question? A Multi-channel Deep Convolutional Neural Network for Question Identification in Textual Data
- KIM_CNN: Convolutional Neural Networks for Sentence Classification
- XML_CNN: Deep Learning for Extreme Multi-label Text Classification
- Seq_cnn:Effective Use of Word Order for Text Categorization with Convolutional Neural Networks
- FastText:Bag of Tricks for Efficient Text Classification
- CHAR_CNN:Character-level Convolutional Network
- Bi_LSTM: a bi-lstm implementation which is equivalent of the quest-cnn in the paper Where's the Question? A Multi-channel Deep Convolutional Neural Network for Question Identification in Textual Data
For each model, we provide additional README in the folder with directions about how to run each model
We recommend installing and running the code from within a virtual environment.
First, download Anaconda from this link
Second, create a conda environment with python 3.7.
$ conda create -n cnn37 python=3.7
Upon restarting your terminal session, you can activate the conda environment:
$ conda activate cnn37
In the project root directory, run the following to install the required packages.
pip install -r requirements.txt
Finally, the stopwords from the NLTK library need to be download:
import nltk
- Google pre-trained embeddings
In order to use pre-trained embeddings for the word embeddings (or the semantic embeddings), you need to dowload GoogleNews-vectors-negative300.bin.gz into the folder embedding_input/google_embedding
An easy way for dowloading is by:
wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
- Mimic pre-trained embeddings
Unfortunately, we cannot provide the embeddings of the MIMIC III dataset as training course is mandatory in order to access the particular dataset but the code can be still executed by only using the Google embeddings.
However, we provide the code for the creation of the mimic embeeding in the file which will require the NOTEEVENTS.csv from the MIMIC III dataset
In the preprocessing folder, we provide code and instruction about how to extract potential question and the creation of all the features of the deep neural network models
In order to tune the hyperpameter of each model you need to create a json file like the file the search_spaces/cnn.json and add it to search_spaces/
Afterwards, run:
python3 param_json.py --model_name "model_name" -fn "results_file_name" - -jf "search_spaces/model.json" -st search_trials
The end results will be saved in dataset_output/hyperpameters/ and it will create three files:
- results_file_name.csv : contains all the final F1 scores for each search trial
- results_file_name.json : contains the best hyper-parameters for the model
- results_file_name_param.csv : number of parameters of the model
In order to run any model firstly you need to add the file that contains the sentences in question in dataset_input/. This files need to have at least two collumns (sentences, label) but in order to use more features it needs additional columns (like pos-tag, medical-terms, ...)
Afterwards, run:
python3 main_iterations.py --model_name "model_name" -fn "results_file_name"
The end results will be saved in dataset_output/results/ and it will create two files:
- results_file_name.csv : contains all the results for each seed, the mean and standard deviation for the testing set
- results_file_name_val.csv : contains all the results for each seed, the mean and standard deviation for the validation set
In order to see all the parameters can be changed for additional experiments:
python main_iterations.py -help
usage: main_iterations.py [-h] [-modn MODEL_NAME] [-fn DATA_FINAL_NAME]
[-dn DATA_NAME] [-ner NER] [-df DATA_FILE]
optional arguments:
-h, --help show this help message and exit
-modn MODEL_NAME, --model_name MODEL_NAME
name of the anmodel we are using
-fn DATA_FINAL_NAME, --data_final_name DATA_FINAL_NAME
result name.
-dn DATA_NAME, --data_name DATA_NAME
Dataset name.
-ner NER, --ner NER whether we use ner or re task
-df DATA_FILE, --data_file DATA_FILE
Path to dataset.
-dft DATA_FILE_TEST, --data_file_test DATA_FILE_TEST
Path to dataset test set.
-dd DATA_FILE_DEV, --data_file_dev DATA_FILE_DEV
Path to dataset.
-e EMBD_FILE, --embd_file EMBD_FILE
Path to Embedding File of google.
-e_mimic EMBD_FILE_MIMIC, --embd_file_mimic EMBD_FILE_MIMIC
Path to Embedding File of mimic.
-e_flag EMBEDDING_FLAG, --embedding_flag EMBEDDING_FLAG
1 use google embedding, 2 use mimic dataset, 3 random
-cn CLASS_NUMBER, --class_number CLASS_NUMBER
Number of class
Ratio of training set.
-tv TEST_VAL_RATIO, --test_val_ratio TEST_VAL_RATIO
Ratio of testing/validation set.
embedding size
-opt OPTIM, --optim OPTIM
-b BATCH_SIZE, --batch_size BATCH_SIZE
Batch Size.
-n NUM_ITERS, --num_iters NUM_ITERS
Number of iterations/epochs.
-lr LEARNING_RATE, --learning_rate LEARNING_RATE
Learning rate for optimizer.
-wd WEIGHT_DECAY, --weight_decay WEIGHT_DECAY
weight decay
-usemb USE_EMBEDDING, --use_embedding USE_EMBEDDING
if we use pre-training embedding
If we will continue the training of embedding.
-sp SAVE_PATH, --save_path SAVE_PATH
path where the model will be saved
-pr PRINTING_LOSS, --printing_loss PRINTING_LOSS
whether we print the training loss in each epoch
whether we use mutlichannel for Kim
-dr DROPOUT, --dropout DROPOUT
dropout for cnn_text
-fm FEATURE_MAPS, --feature_maps FEATURE_MAPS
size of feature map for each filter
size for each filter
-z HIDDEN_SIZE, --hidden_size HIDDEN_SIZE
Number of Units in LSTM layer.
-qm QUESTION_NAME, --question_name QUESTION_NAME
name of the column that contain questions