/Ntds_project_team02

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Wikipedia Recommender System

Welcome to our project repository for the Network Tour of Data Science course at EPFL !

We implemented a query-based search engine for Wikipedia articles related to various Machine Learning topics.

In other words, given a query our system will retrieve and suggest articles with similar semantic contents. Moreover, we provide a graph visualisation tool to interact with the query engine.

More details about this ML system can be found in the project [report](Team 02 - Project report.pdf).

How to reproduce results:

Note that 'wd' is the directory containing the run.sh script (in the project folder).

  • Run the command export PYTHONPATH=wd

NOTE: if you want to use a virtual environment, run the following:

  • python3 -m venv ntds
  • echo 'export PYTHONPATH=wd' >> ntds/bin/activate

From wd, run the following:

  • Run the command sudo apt install build-essential python-dev libxml2 libxml2-dev zlib1g-dev bison flex
  • pip3 install -r requirements.txt
  • pip3 install pymagnitude==0.1.120 --no-binary :all:
  • Specify INITIAL_FILENAME in config.py. This is the name of the file produced on Seealsology (to put in the data folder). The seeds to scrap the graph are given in the seeds_seealsology.txt file (we used a distance of 2).
  • Download the wiki-news-300d-1M-subword.magnitude file at and put it into the data folder.
  • Execute the run.sh script (takes a few minutes to run).
  • Run exploration.ipynb and/or exploitation.ipynb for the respective analysis.

Interactive Visualisation:

After having done the previous part, run the command: python3 visualization/app.py 8888

NOTE: if you want to put the app online like on the following link, you have to do all the above installs in "sudo" mode, and run the following command instead: sudo PYTHONPATH=wd python3 visualization/app.py 80. Another option is that you enable port 80 for current user.

You can choose any of the three methods to perform a query.

For multiple concepts, please separate by a comma, e.g. machine learning,text processing The port 80 must be opened for external access if you use a server.

  • By clicking on a node, 'Chosen node' link will redirect you to the corresponding web page.
  • Only the page title of nodes that best fit the query as well as the neighbours are shown.
  • Red edges mean that the pages are present in the 'See also' section on Wikipedia website.
  • The color of the nodes represents the cosine similarity score.

This web app has been only tested on Chrome for Linux (78.0.3904.70).

Files breakdown:

run.sh : shell script executing the acquisition, exploitation and visualisation tasks.

Acquisition:

  • acquisition_helpers.py : various helpers for the acquisition.py script,
  • acquisition.py : loads the dataset and augments it with urls and keywords extraction. Create df_node dataframe which contains node information and df_edge which contains edge relation.

Exploration:

Exploitation:

Visualization:

  • app.py: runs the visualisation app on a dedicated server
  • create_visu.py: creates and saves the graph visualisation
  • utils.py: various helpers

Helpers:

Data:

  • Data: contains every file loaded and generated by the different modules.

Authors

  • EL Amrani Ayyoub
  • Micheli Vincent
  • Myotte Frédéric
  • Sinnathamby Karthigan

License

Wikipedia Recommender System - Network Tour of Data Science EE-558 - EPFL - Fall 2019 - Team 2

Copyright (c) 2019 EPFL

This program is licensed under the terms of the GPL.