Article retrieval system

Read our report.

Installation

Python 3 is required.

pip3 install -r requirements.txt
python -m nltk.downloader popular

Content

Crawling

You should modify two configuration files to crawl websites:

config/crawler.json stores crawling settings, such as user_agent, max_depth etc.
config/hosts.json contains websites for crawling.

Example of crawler.json

{
  "user_agent": "crawler-bot",
  "max_pages_count": 100000,
  "max_depth": 10,
  "delay": 50,
  "frontier_dump_delay_s": 600
}

Example of hosts.json

{
  "hosts": [
    "https://arxiv.org/",
    "https://www.springer.com",
    "https://link.springer.com"
  ]
}

To start crawling please run python3 main.py crawler
All downloaded pages are in webdata/ folder.

Parsing

After crawling, you need to parse downloaded data.
You have to run python3 main.py parser to do it.
Currenlty we support parsers for arxiv.org and springer.com.

Indexing

Page attribute file

For each article, we store its attributes: title, abstract, processed abstract, number of words, the link to pdf and so on.

Inverted index

We build an inverted index. For each word, we store a list with documents that contain this word. And for each document, we store the number of occurrences of this word in the document with positions of these occurrences. The document lists are sorted by the number of occurrences.

To start indexing run python3 main.py index

Article similarity graph

We associate each article with a point on 2D plane. The expected property of these points is that we want close points correspond to the “similar” articles. By the similarity of the articles we mean that they have similar meaning, common theme etc. To measure “similarity” of articles we’re using gensim’s doc2vec model. Each article is associated with a multidimensional vector which has been trained using this model. Then, we reduce dimension of those vectors to 2 using t-SNE. The problem is that for the majority of search queries the top 5 articles are not located close to each other. Most likely this problem can be solved by obtaining larger dataset of articles and by tuning doc2vec’s hyperparameters. However, for some queries results are pretty nice: