- First the crawler runs and saves the documents on disk as pickle files. The crawler is designed to respect the robots.txt file and crawl delays. We separated crawling and indexing because of the time it takes for the doc2query model to extend each documents tokens.
- When the crawler is finished, the index is built. It reads the documents from disk, stores them inside an inverted index and calculates the TF-IDF and BM25 scores. By default, it extends the document tokens by the doc2query method.
- For the user interface we use a web interface created with React
- As API to the backend we use a Django REST Server
cd backend
pip install -r requirements.txt
To use the doc2query implementation download the models by running
pip install --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git
pip install --upgrade git+https://github.com/terrierteam/pyterrier_dr.git
To run the crawler execute
cd backend/core
python Crawler.py
The crawled documents are stored as pickle files inside serialization\documents
.
The state of the crawler is also stored in this folder as crawl_state.pickle
. When a crawl_state file already exists,
the crawler will load the file and resume the crawling from the loaded frontier.
To index all documents execute
cd backend/core
python DocumentIndex.py
The generated index file is stored at serialization\documents\index.pickle
.
The topic model is needed to rerank documents for diversity. A model trained on already crawled documents is stored atserialization\ldamodel.pickle
.
To train a new model on your crawled documents execute:
cd backend/core
python LDAmodel.py
To start the django rest server:
cd backend/SearchEngineServer
python manage.py runserver
This will automatically load the index inside the serialization folder. The server will listen on port 8000 to answer query requests.
Starts the frontend on port 3000
cd frontend
npm install
npm start
cd backend/core
python batch_retrieve.py path/to/queryFile.txt path/to/resultList.txt
The query file should contain one query number and one query per line seperated by tabs.
cd backend/core
python streamlit run DataCreator.py