2020 Hogeschool Utrecht 2020 Course on Distributed Processing.
Run the full assignment ./setup.sh
in the folder MapReduce
python mapper.py < english.txt | sort -k1,1| python reducer.py english
python mapper.py < dutch.txt | sort -k1,1| python reducer.py dutch
Here you can alter what char's will be converted to a standerd _
The reducer will return and save a matrix of counted letter combinations. Also normalizes the matrix, default can be changed.
python main.py validation.txt
This python files ties everything together, just in a way to show it can be done without unreadable commands. This file will print a prediction per result and a final score per language.
Normalizing the data really helped in filtering out actual patterns, here you can see the mean and the standard deviation. This will be printed to the shell when running the setup/main.
english: mu:123.70, std:339.93
dutch: mu:127.85, std:359.16
The final line will be the results. Here you can see that there is a high accuracy. The actual 73 dutch and 119 (including newlines and sources lines...) isn't that far of from my predicted data. {'english_matrix': 116, 'dutch_matrix': 75}