This project is inspired by Raúl Marín's (LinkedIn | GitHub) project Open Source Big Data Toolbelt (OSBDET) and provides an educational platform for Big Data technologies on the Raspberry Pi (RPi).
The newest version of the small single-board computer - the Raspberry Pi 4 B - is equipped with a faster processor and more RAM than its predecessors. This allows users to run more demanding applications while still benefitting from the energy efficiency of the ARM-Architecture. The RPi only uses 6.4W under full load, which is ~1/10 of a traditional light-bulb.
This Ansible-Playbook installs and configures the following technologies:
- Archiconda3 (conda on 64bit arm)
- Apache Kafka (message broker & storage)
- Apache Spark (stream processing)
- Apache Superset (visualization)
- Jupyter Lab (next gen coding interface)
- MariaDB (storage)
- Samba (network file service)
- a Raspberry Pi 4/400 (4Gb+)
- a USB-C power supply
- a boot drive (SD card or USB 3 stick)
- admin access to your router
- recommended: a LAN-cable to connect to the router
- install Ubuntu 21.04 Server (64bit) on your RPi (link to tutorial)
- SSH into the RPi and put your SSH Key on it (link to tutorial)
- Install Ansible on your Laptop/Desktop/control machine (via pip | official documentation)
- download this repository to your control machine with
git clone https://github.com/Maximilian-Pichler/BDPi
- open the
hosts.ini
file and put the IP-Adress of your RPi in the second line. - if you use additional storage, add the UUID and the Format-Type of your drive here too, otherwise leave these strings empty.
- if you don't want to install the E2E-demo, then go to
ansible-playbook playbook.yml
and delete the line- role: demo
- execute
ansible-playbook playbook.yml
from the BDPi-Repository folder on your control machine. - get something to drink...or eat. This will take a while.*
- The time it takes to run the playbook heavily depends on your boot drive. A USB 3 stick is the preferred choice and reduces the time needed to approx. 45min
Once the installation is finished, the services are listening on the following ports:
Service | Port | Password / Token | user |
---|---|---|---|
SparkHub* | 4040 | ||
Jupyter Lab | 8881 | abcd | |
Maria DB | 3306 | abcd | ubuntu |
Superset | 8088 | abcd | ubuntu |
Network Storage | 445 | abcd | ubuntu |
*only with an active Spark-Job
From a security standpoint this configuration is not ideal, but makes things easier to access.
- Because I wanted to learn it after reading about its capabilities. (Deploy a Kafka Cluster with Ansible | short introduction video)
- Because I mess things up and automation makes starting from scratch less painful.
- Because I'm lazy. Ansible-Playbooks are easy-to-read/understand and reduce the need for documentation.
If you installed the demo you can find instructions here.
- Continuous Integration Workflow on selfhosted GitLab
- Spark Cluster Integration
- VPN for remote access