- Thomas Ørkild - s154433@student.dtu.dk
- Christian Ingwersen - s154264@student.dtu.dk
The final project report can be downloadet from Dropbox here.
In this project we have investigated different algorithms to find the K nearest Wikipedia articles to a given text. The idea with our tool is that Wikipedia contributers can write a new text and find all the similar articles already on Wikipedia. Our idea is that the contributers then can link these articles in the new text to help the reader find related knowledge.
The tool have been implemented as an exam project in the DTU course 02807 Computational Tools for Data Science.
-
Download the Wikipedia database from here.
-
Extract and clean the downloaded Wikipedia dump file, wiki-dump.bz2, using WikiExtractor
python wikiextractor/WikiExtractor.py wiki-dump.bz2
-
The model assumes the cleaned wiki files to be in
~/Documents/text
, so copy all the files to there.mv text/ ~/Documents/
-
Train the gensim doc2vec model.
python train_doc2vec.py
-
Start the webserver. The code assumes that the trained model is located at
~/Documents/my_doc2vec_model_trained
.python WebService.py
-
Use post-requests to query the web server. The first time this is run, it will generate the annoy index with 50 trees, which takes about 1,5 hour.
python most_similar.py -u http://localhost:5000/api/mostsimilar/ --method annoy <<< "This is a test string"
Inference locally:
python most_similar.py -m /path/to/trained_doc2vec_Model -n 10 < testfile.txt
To run inference on a web service, start the web service:
python WebService.py
and then to run inference on the server:
python most_similar.py -u http://localhost:5000/api/mostsimilar/ < testfile.txt