In the third milestone, we compile relational algebra queries into a physical query plan of
MapReduce jobs.
The MapReduce jobs can then be executed directly on Hadoop by the intermediate of
Python luigi module: which is a workflow engine that can execute MapReduce jobs Locally or
on hadoop (among many other things).
We will use the following commands line to evaluate/execute a task
python3.6 ra2mr.py SelectTask --querystring "\select_{gender='female'} Person;" --exec-environment LOCAL --local-scheduler
PYTHONPATH=. luigi --module ra2mr SelectTask --querystring "\select_{gender='female'} Person;"
--exec-environment HDFS --local-scheduler
To execute tests locally, we will use the pytest module by running the command: pytest test_e2e.py or
pytest ra2mr.py.
The unit tests set the task parameter exec_environment to MOCK. All files are then
kept in main memory only. This is intended for unit testing.
The problem is that the module is not signed and therefore not loaded with the kernel.
This will happen if your computer has the SecureBoot mode activated, something very common in modern equipment.
That's why you'll get this error opening any machine in the virtual box (Kernel driver not installed (rc=-1908))
Do the following steps to sign a driver, and it is loaded as a kernel module, on Ubuntu systems and also on Debian 9:
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install mokutil
openssl req -new -x509 -newkey rsa:2048 -keyout MOK.priv -outform DER -out MOK.der -nodes -days 36500 -subj "/CN=VirtualBox/"
sudo /usr/src/linux-headers-$(uname -r)/scripts/sign-file sha256 ./MOK.priv ./MOK.der $(modinfo -n vboxdrv)
IMPORTANT! That will ask you for a password, put the one you want, you will only have to use it once in the next reboot.
sudo mokutil --import MOK.der
Enroll MOK -> Continue ->, and it will ask you for the password, and it's done.
1- Download the Cloudera VM: Link >>
2- Change the keyboard layout by running the command: setxkbmap fr
To do this automatically every time, extend your .bashrc with the command: echo "setxkbmap us" >> ~/.bashrc
3- Mount a shared folder so that we can easily share data between the host, and the virtual machine:
Link >>
4- Open a terminal. we’ll need a more modern Python version and some extra modules.
wget https://www.python.org/ftp/python/3.6.5/Python-3.6.5.tar.xz
xz -d Python-3.6.5.tar.xz
tar -xvf Python-3.6.5.tar
cd Python-3.6.5
./configure --prefix=/usr/local
make
sudo make altinstall
cd ..
sudo rm -rf Python*
wget https://bootstrap.pypa.io/get-pip.py
sudo /usr/local/bin/python3.6 get-pip.py
rm -f get-pip.py
sudo /usr/local/bin/python3.6 -m pip install luigi
sudo /usr/local/bin/python3.6 -m pip install sqlparse
sudo /usr/local/bin/python3.6 -m pip install radb
sudo /usr/local/bin/python3.6 -m pip install pytest
sudo /usr/local/bin/python3.6 -m pip install pytest-repeat
sudo /usr/local/bin/python3.6 -m pip uninstall -y antlr4-python3-runtime
sudo /usr/local/bin/python3.6 -m pip install antlr4-python3-runtime==4.7
Python yield usage and concept:
Link >>
To well understand the concept of mapreduce we can have a look at the chapter on “Workflow Systems” for MapReduce engines in chapter 2.4.1 of the
book “Mining Massive Datasets”.