BIG DATA Pi

This project is inspired by Raúl Marín's (LinkedIn | GitHub) project Open Source Big Data Toolbelt (OSBDET) and provides an educational platform for Big Data technologies on the Raspberry Pi (RPi).

The newest version of the small single-board computer - the Raspberry Pi 4 B - is equipped with a faster processor and more RAM than its predecessors. This allows users to run more demanding applications while still benefitting from the energy efficiency of the ARM-Architecture. The RPi only uses 6.4W under full load, which is ~1/10 of a traditional light-bulb.

Technologies

This Ansible-Playbook installs and configures the following technologies:

Archiconda3 (conda on 64bit arm)
Apache Kafka (message broker & storage)
Apache Spark (stream processing)
Apache Superset (visualization)
Jupyter Lab (next gen coding interface)
MariaDB (storage)
Samba (network file service)

HOW TO USE IT

Prerequisites:

a Raspberry Pi 4/400 (4Gb+)
a USB-C power supply
a boot drive (SD card or USB 3 stick)
admin access to your router
recommended: a LAN-cable to connect to the router

Initial Setup:

install Ubuntu 21.04 Server (64bit) on your RPi (link to tutorial)
SSH into the RPi and put your SSH Key on it (link to tutorial)
Install Ansible on your Laptop/Desktop/control machine (via pip | official documentation)

Run the Ansible Playbook

download this repository to your control machine with git clone https://github.com/Maximilian-Pichler/BDPi
open the hosts.ini file and put the IP-Adress of your RPi in the second line.
if you use additional storage, add the UUID and the Format-Type of your drive here too, otherwise leave these strings empty.
if you don't want to install the E2E-demo, then go to ansible-playbook playbook.yml and delete the line - role: demo
execute ansible-playbook playbook.yml from the BDPi-Repository folder on your control machine.
get something to drink...or eat. This will take a while.*
- The time it takes to run the playbook heavily depends on your boot drive. A USB 3 stick is the preferred choice and reduces the time needed to approx. 45min

Have Fun

Once the installation is finished, the services are listening on the following ports:

Service	Port	Password / Token	user
SparkHub*	4040
Jupyter Lab	8881	abcd
Maria DB	3306	abcd	ubuntu
Superset	8088	abcd	ubuntu
Network Storage	445	abcd	ubuntu

*only with an active Spark-Job

From a security standpoint this configuration is not ideal, but makes things easier to access.

WHY ANSIBLE?

Because I wanted to learn it after reading about its capabilities. (Deploy a Kafka Cluster with Ansible | short introduction video)
Because I mess things up and automation makes starting from scratch less painful.
Because I'm lazy. Ansible-Playbooks are easy-to-read/understand and reduce the need for documentation.

Demo

If you installed the demo you can find instructions here.

UPCOMING FEATURES (maybe)

Continuous Integration Workflow on selfhosted GitLab
Spark Cluster Integration
VPN for remote access

Maximilian-Pichler/BDPi