/Milestone3

Primary LanguagePythonBoost Software License 1.0BSL-1.0

Milestone3

In the third milestone, we compile relational algebra queries into a physical query plan of MapReduce jobs.
The MapReduce jobs can then be executed directly on Hadoop by the intermediate of Python luigi module: which is a workflow engine that can execute MapReduce jobs Locally or on hadoop (among many other things). We will use the following commands line to evaluate/execute a task

Locally

python3.6 ra2mr.py SelectTask --querystring "\select_{gender='female'} Person;" --exec-environment LOCAL --local-scheduler

On hadoop

PYTHONPATH=. luigi --module ra2mr SelectTask --querystring "\select_{gender='female'} Person;"
--exec-environment HDFS --local-scheduler
To execute tests locally, we will use the pytest module by running the command: pytest test_e2e.py or pytest ra2mr.py.
The unit tests set the task parameter exec_environment to MOCK. All files are then kept in main memory only. This is intended for unit testing.

Fixing VirtualBox problem for linux users

The problem is that the module is not signed and therefore not loaded with the kernel.
This will happen if your computer has the SecureBoot mode activated, something very common in modern equipment. That's why you'll get this error opening any machine in the virtual box (Kernel driver not installed (rc=-1908))
Do the following steps to sign a driver, and it is loaded as a kernel module, on Ubuntu systems and also on Debian 9:

Install the mkutil package to be able to do signed:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install mokutil

Generate the signature file:

openssl req -new -x509 -newkey rsa:2048 -keyout MOK.priv -outform DER -out MOK.der -nodes -days 36500 -subj "/CN=VirtualBox/"

Add it to the kernel:

sudo /usr/src/linux-headers-$(uname -r)/scripts/sign-file sha256 ./MOK.priv ./MOK.der $(modinfo -n vboxdrv)

Register it for the Secure Boot.

IMPORTANT! That will ask you for a password, put the one you want, you will only have to use it once in the next reboot.
sudo mokutil --import MOK.der

Finally, restart the computer.

Enroll MOK -> Continue ->, and it will ask you for the password, and it's done.

Steps for setting up the Claudera VM:

1- Download the Cloudera VM: Link >>
2- Change the keyboard layout by running the command: setxkbmap fr
To do this automatically every time, extend your .bashrc with the command: echo "setxkbmap us" >> ~/.bashrc
3- Mount a shared folder so that we can easily share data between the host, and the virtual machine: Link >>
4- Open a terminal. we’ll need a more modern Python version and some extra modules.

Download Python 3.6:

wget https://www.python.org/ftp/python/3.6.5/Python-3.6.5.tar.xz
xz -d Python-3.6.5.tar.xz
tar -xvf Python-3.6.5.tar
cd Python-3.6.5
./configure --prefix=/usr/local

Let's build (compile) the source, this can take a while

make
sudo make altinstall
cd ..
sudo rm -rf Python*

Install PIP:

wget https://bootstrap.pypa.io/get-pip.py
sudo /usr/local/bin/python3.6 get-pip.py
rm -f get-pip.py

Install further modules that we will need:

sudo /usr/local/bin/python3.6 -m pip install luigi
sudo /usr/local/bin/python3.6 -m pip install sqlparse
sudo /usr/local/bin/python3.6 -m pip install radb
sudo /usr/local/bin/python3.6 -m pip install pytest
sudo /usr/local/bin/python3.6 -m pip install pytest-repeat
sudo /usr/local/bin/python3.6 -m pip uninstall -y antlr4-python3-runtime
sudo /usr/local/bin/python3.6 -m pip install antlr4-python3-runtime==4.7

Useful links:

Python yield usage and concept: Link >>
To well understand the concept of mapreduce we can have a look at the chapter on “Workflow Systems” for MapReduce engines in chapter 2.4.1 of the book “Mining Massive Datasets”.