This repository contains the code used to run the experiments of the Master thesis Text-based Prediction of Popular Click Paths in Wikipedia
# Setup virtual Python 3.7 environment (with conda)
conda create -n smash-rnn python=3.7
conda activate smash-rnn
# Install dependencies
pip install -r requirements.txt
This project uses a combination of English language Wikipedia articles and Wikipedia English clickstream data.
-
Download the datasets used in this experiment here. You need to unzip this file in the project root folder.
-
Unzip the file in the root folder of the project
To create the datasets from scratch, you need to run the steps below. Please note that:
- This process may take up to 8 hours to complete when running on a server (data extraction and tokenization are the most time consuming tasks).
- The availability of Wikipedia dumps is limited - in general only the last 3-5 dumps are available.
- This process was tested only with the English language, and Simple English Wikipedia dumps.
- Download Clickstream data: https://dumps.wikimedia.org/other/clickstream/
- Download Wikipedia dump: https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/
- Move the Clickstream and Wikipedia dump files into the folder
./data/source
- In the root folder, run the script
dataset_creator.py
python dataset_creator.py
You can find trained models in this link. You need to unzip this file in the project root folder.
- Smash RNN: To train the model with default parameters, go to the root of the application and run:
python train_smash_rnn.py
Options
- --num_epochs: The number of epochs to train the model (default: 1)
- --batch_size: The number of articles in each batch. This value needs to be small when using the complete article structure due to memory limitation issues (default: 6)
- --level: The deepest level of Smash RNN. Possible choices are
word
,sentence
orparagraph
(default:paragraph
) - --paragraphs_limit: Maximum number of paragraphs per article that will be processed (default: 300)
- --model_name: Name to use when saving the model and results (default:
base
) - --w2v_dimension: Number of dimensions of Word2Vec (default: 50)
- --introduction_only: Whether the model should use only the introduction section or the complete text of the article (default:
False
)
To test the model with default parameters, run:
- python test_smash_rnn.py --model_name=NAME_OF_THE_MODEL_TRAINED --level=word.
Options
- --batch_size: The number of articles in each batch (default: 6)
- --level: The deepest level of Smash RNN. Possible choices are
word
,sentence
orparagraph
(default:paragraph
) - --paragraphs_limit: Maximum number of paragraphs per article that will be processed (default: 300)
- --model_name: Name to use when saving the model and results. Should match the name of a trained model (default:
base
) - --w2v_dimension: Number of dimensions of Word2Vec (default: 50)
- --introduction_only: Whether the model should use only the introduction section or the complete text of the article (default:
False
)
Additional models were developed to compare the results with Smash-RNN:
-
Wikipedia2Vec:
- First you need to learn the embeddings for the Wikipedia2Vec model. See instructions in https://wikipedia2vec.github.io/wikipedia2vec/commands/
- Train:
python train_wikipedia2vec.py
- Test:
python test_wikipedia2vec.py
-
Doc2Vec:
- Train:
python train_doc2vec.py
- Test:
python test_doc2vec.py
To generate the tables and figures from Section Results
of the thesis, please follow the steps presented in Jupyter Notebook Results analyzer.ipynb