This was implemented using the mrs MapReduce framework
This can be run by running the runWordIndex.py file in terminal, which automatically searches through all three files - small.txt, large.txt and very_large.txt.
It also makes use of the stopWords.txt
file.
The results are stored in their respective output folders.
These algorithms were implemented using the MrJob Framework
- The stopwords used for the algorithms in this implementation are located in
stopwords_en.txt
The algorithms for this lab is written in Python 3 The requirements for the virtual environment can be found in requirements.txt
Please ensure that the following are installed:
Python 3.6+
pip3
virtualenv
-
Create a virtual environment
virtualenv venv
-
Activate the virtual environment:
source venv/bin/activate
-
Install the requirements for the virtual environment:
pip3 install -r requirements.txt
-
To deactive the virtual environment:
deactivate
- Word Count:
python3 runWordFiles
- Top-K Query:
python3 runTopKFiles
- Inverted Index:
python3 runWordIndex.py