This is an implementation of single / multi-core mapreduce.
The multicore implementation calculates the length of the file, since the multiprocess module does not support pickling of generators we create a set of range objects which represent line indices and passes them to the subprocess. Each subprocess iterates lazily through the files and only extracts the wanted lines.
Steps:
-
git checkout this repo
-
install venv with:
python3 -m venv env
-
activate the environment
source env/bin/activate
-
run setup.py
pip install -e .
-
Run the scripts, note the -m flag for multicore (default False)
python word_counter.py data/raw/if-kipling.txt -m True
python average_ratings.py data/raw/ratings.txt -m True