The implementation details of our model are contained in the file: 4_final_model.py. Here we are providing the latest version of the CNN-LSTM model including:
- hand-crafted features
- the attention mechanism
- class weights balance
In order to run the experiments it is necessary to download the Pitt Corpus transcripts from here: https://dementia.talkbank.org/access/English/Pitt.html
We do not include that data in our submission because it is private, and authorization to use them is needed. To obtain authorization, follow the instructions at: https://dementia.talkbank.org/.
Once the data has been downloaded it is necessary to maintain this folder structure; empty folders are left in our supplement as a reference:
root:
- data:
-- Pitt_transcripts:
--- Control:
---- cookie:
---- fluency:
--- Dementia:
---- cookie:
---- fluency:
---- recall:
---- sentence:
It is necessary to run the following steps:
-
Run file 0_pitt_transcript_preprocessing_and_pickle.py. This will preprocess the interviews and create a .pickle file.
-
Run 1_pitt_anagraphic_information.py. This script will produce a .pickle file containing demographic information for the patients starting from the file anagraphic_modded.csv (this section of the dataset is freely available).
-
Run 2_psycolinguistic_features_computation_and_merge.py. This file will merge the above produced files and compute other linguistic features mentioned in the paper. This file will produce "pitt_full_interview_features.pickle," which is necessary to run the model.
-
Download Glove embeddings 300d from: http://nlp.stanford.edu/data/glove.6B.zip and place them into the glove.6B folder.
-
Run the 4_final_model.py file. This file will train the model and perform tests with three different data shuffles. It will produce a list of three dictionaries containing fundamental classifier metrics obtained on each split.