Where's the Question? A Multi-channel Deep Convolutional Neural Network for Question Identification in Textual Data

(accepted paper in The 3rd Clinical Natural Language Processing Workshop)

General info

This is the code that was used of the paper : Where's the Question? A Multi-channel Deep Convolutional Neural Network for Question Identification in Textual Data where we created a multi-channel convolutional neural network for the seperation of sentences to question, not-questions and c-questions questions referring to an issue mentioned in a nearby sentence (e.g., can you clarify this?)

Technologies

This project was created with python 3.7 and PyTorch 0.4.1

Models

We provide code of the following models:

Quest_CNN: code for Where's the Question? A Multi-channel Deep Convolutional Neural Network for Question Identification in Textual Data
KIM_CNN: Convolutional Neural Networks for Sentence Classification
XML_CNN: Deep Learning for Extreme Multi-label Text Classification
Seq_cnn:Effective Use of Word Order for Text Categorization with Convolutional Neural Networks
FastText:Bag of Tricks for Efficient Text Classification
CHAR_CNN:Character-level Convolutional Network
Bi_LSTM: a bi-lstm implementation which is equivalent of the quest-cnn in the paper Where's the Question? A Multi-channel Deep Convolutional Neural Network for Question Identification in Textual Data

For each model, we provide additional README in the folder with directions about how to run each model

Setup

We recommend installing and running the code from within a virtual environment.

Creating a Conda Virtual Environment

First, download Anaconda from this link

Second, create a conda environment with python 3.7.

$ conda create -n cnn37 python=3.7

Upon restarting your terminal session, you can activate the conda environment:

$ conda activate cnn37

Install the required python packages

In the project root directory, run the following to install the required packages.

pip install -r requirements.txt

Finally, the stopwords from the NLTK library need to be download:

python
import nltk
nltk.download()

Dowload pre-trained embeddings

Google pre-trained embeddings

In order to use pre-trained embeddings for the word embeddings (or the semantic embeddings), you need to dowload GoogleNews-vectors-negative300.bin.gz into the folder embedding_input/google_embedding

An easy way for dowloading is by:

wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

Mimic pre-trained embeddings

Unfortunately, we cannot provide the embeddings of the MIMIC III dataset as training course is mandatory in order to access the particular dataset but the code can be still executed by only using the Google embeddings.

However, we provide the code for the creation of the mimic embeeding in the file which will require the NOTEEVENTS.csv from the MIMIC III dataset

Extracting questions and creation of features of the deep neural network models

In the preprocessing folder, we provide code and instruction about how to extract potential question and the creation of all the features of the deep neural network models

Running code

Hyperpameter tuning

In order to tune the hyperpameter of each model you need to create a json file like the file the search_spaces/cnn.json and add it to search_spaces/

Afterwards, run:

python3 param_json.py --model_name "model_name"  -fn "results_file_name" - -jf "search_spaces/model.json" -st search_trials

The end results will be saved in dataset_output/hyperpameters/ and it will create three files:

results_file_name.csv : contains all the final F1 scores for each search trial
results_file_name.json : contains the best hyper-parameters for the model
results_file_name_param.csv : number of parameters of the model

Running model

In order to run any model firstly you need to add the file that contains the sentences in question in dataset_input/. This files need to have at least two collumns (sentences, label) but in order to use more features it needs additional columns (like pos-tag, medical-terms, ...)