/articlix

Information retrieval project at SPbAU 7th term

Primary LanguagePythonMIT LicenseMIT

articlix

Information retrieval project at SPbAU 7th term

Installation

Dev

We use python and pipenv as a primary tools for development. See Pipfile, Pipfile.lock, requirements-dev.txt(if any) and requirements.txt for full specification of platform, python and dependency packages.
Basically, to reproduce enviroment, you need to run pip install -r requirements.txt with certain version of python. However, it is recommended to use virtualenv.

Makefile

We provide Makefile for convinient commands implementation.
Run make help for get info on that.

Prerequisites

  • psql>=10.0 for crawler to store pages

Usage

We provide main.py script, which implements cli interface.
Run python main.py -h to get info on that.

Crawler

python main.py crawler

Index

You can now preprocess data (look at this).
Then python main.py --dfpath="data/clean_articles.h5" --indexpath="data/index.json" --workers=8 index.

Data

Where to find prepared data

Search

Examples

Web interface

Run python main.py web_interface. Then you can find page at localhost on port 8080.

License

MIT