Ntds_project_team02: A Jupyter Notebook repository from FredBaos

Wikipedia Recommender System

Welcome to our project repository for the Network Tour of Data Science course at EPFL !

We implemented a query-based search engine for Wikipedia articles related to various Machine Learning topics.

In other words, given a query our system will retrieve and suggest articles with similar semantic contents. Moreover, we provide a graph visualisation tool to interact with the query engine.

More details about this ML system can be found in the project [report](Team 02 - Project report.pdf).

How to reproduce results:

Note that 'wd' is the directory containing the run.sh script (in the project folder).

Run the command export PYTHONPATH=wd

NOTE: if you want to use a virtual environment, run the following:

python3 -m venv ntds
echo 'export PYTHONPATH=wd' >> ntds/bin/activate

From wd, run the following:

Run the command sudo apt install build-essential python-dev libxml2 libxml2-dev zlib1g-dev bison flex
pip3 install -r requirements.txt
pip3 install pymagnitude==0.1.120 --no-binary :all:
Specify INITIAL_FILENAME in config.py. This is the name of the file produced on Seealsology (to put in the data folder). The seeds to scrap the graph are given in the seeds_seealsology.txt file (we used a distance of 2).
Download the wiki-news-300d-1M-subword.magnitude file at and put it into the data folder.
Execute the run.sh script (takes a few minutes to run).
Run exploration.ipynb and/or exploitation.ipynb for the respective analysis.

Interactive Visualisation:

After having done the previous part, run the command: python3 visualization/app.py 8888

NOTE: if you want to put the app online like on the following link, you have to do all the above installs in "sudo" mode, and run the following command instead: sudo PYTHONPATH=wd python3 visualization/app.py 80. Another option is that you enable port 80 for current user.

You can choose any of the three methods to perform a query.

For multiple concepts, please separate by a comma, e.g. machine learning,text processing The port 80 must be opened for external access if you use a server.

By clicking on a node, 'Chosen node' link will redirect you to the corresponding web page.
Only the page title of nodes that best fit the query as well as the neighbours are shown.
Red edges mean that the pages are present in the 'See also' section on Wikipedia website.
The color of the nodes represents the cosine similarity score.

This web app has been only tested on Chrome for Linux (78.0.3904.70).

Files breakdown:

run.sh : shell script executing the acquisition, exploitation and visualisation tasks.

Acquisition:

acquisition_helpers.py : various helpers for the acquisition.py script,
acquisition.py : loads the dataset and augments it with urls and keywords extraction. Create df_node dataframe which contains node information and df_edge which contains edge relation.

Exploration:

exploration.ipynb: exploratory data analysis.

Exploitation:

exploitation.py: fits and saves the 3 models we used
exploitation.ipynb: loads the models and performs a qualitative evaluation on a set of queries and topics

Visualization:

app.py: runs the visualisation app on a dedicated server
create_visu.py: creates and saves the graph visualisation
utils.py: various helpers

Helpers:

predict.py: helpers for the exploitation part
spectral_clustering.py: specific helpers for the spectral clustering model

Data:

Data: contains every file loaded and generated by the different modules.

Authors

EL Amrani Ayyoub
Micheli Vincent
Myotte Frédéric
Sinnathamby Karthigan

License

Wikipedia Recommender System - Network Tour of Data Science EE-558 - EPFL - Fall 2019 - Team 2

This program is licensed under the terms of the GPL.