This repository contains the code that implements the research described in Shuxin Zhang's master thesis. He carried out this work for his Master Medical Informatics at the University of Amsterdam. The project was supervised by Sílvia Delgado Olabarriaga (tutor) and Allard van Altena (mentor) from the Department of Clinical Epidemiology, Biostatistics, and Bioinformatics, Amsterdam University Medical Center. With special thanks to the CLEF eHealth Lab for providing the necessary data.
There are four major parts (four folders) of Python code which were implemented for this project. These correspond to the four chapters in the thesis:
Chapter 3 Data Preparation
Chapter 4 Baseline Model Design
Chapter 5 Active Learning Model Design
Chapter 6 Evaluation of Active Learning Moddel
run_baseline.py, run_similarity.py, and result_analysis.py are updated codes for further investigation after thesis project.
The whole project is implemented in Python. The required packages are listed in requirement.txt.
With Python and pip installed you can run the following command to install the dependencies:
$pip install -r requirements.txt
To get the results of all experiments, you should execute the following Python Scripts step-by-step:
This creates: a SQL database in the folder 'Database'; fifty pickle files (forty full datasets and ten partial datasets) in the folder 'Datasets'; and one 'word frequency' graph in the folder 'Figure'.
Loop the qrel files (train and test) and fetch the pubmed articles, stick them in a pickle file.
Read the pickle files created by fetch_raw.py and move them to the database.
Checks database against the qrel files and checks whether everything was successfully fetched and stored.
Run preprocessing on the documents and store as a feature matrix.
This part corresponds to all experiments described in Chapter 4 Baseline Model Design of the thesis.
All results are stored as pickle files in the folder "Consolidation", whereas graphs are stored in the 'Figure' folder.
This part corresponds to all experiments described in Chapter 5 Active Learning Model Design of the thesis.
All results are stored as pickle files in the folder "Consolidation", whereas graphs are stored in the 'Figure' folder.
This part corresponds to all experiments described in Chapter 6 Evaluation of Active Learning Moddel of the thesis.
All results are stored as pickle files in the folder "Consolidation", whereas graphs are stored in the 'Figure' folder.
config.py describes the storage 'path' of results and graphs for most experiments;
recipe.py describes which value is selected for each parameter in the classifier. This file should be edited according to the actual experiment results.
Shuxin Zhang, Medical Informatics Master student, University of Amsterdam.
Allard van Altena, PhD candidate, University of Amsterdam.
Sílvia Delgado Olabarriaga (tutor), assistant professor, University of Amsterdam.