/BDPi

E2E Big Data technologies on a Raspberry Pi

Primary LanguagePythonApache License 2.0Apache-2.0

BIG DATA Pi

This project is inspired by Raúl Marín's (LinkedIn | GitHub) project Open Source Big Data Toolbelt (OSBDET) and provides an educational platform for Big Data technologies on the Raspberry Pi (RPi).

The newest version of the small single-board computer - the Raspberry Pi 4 B - is equipped with a faster processor and more RAM than its predecessors. This allows users to run more demanding applications while still benefitting from the energy efficiency of the ARM-Architecture. The RPi only uses 6.4W under full load, which is ~1/10 of a traditional light-bulb.


Technologies

This Ansible-Playbook installs and configures the following technologies:


HOW TO USE IT

Prerequisites:

  • a Raspberry Pi 4/400 (4Gb+)
  • a USB-C power supply
  • a boot drive (SD card or USB 3 stick)
  • admin access to your router
  • recommended: a LAN-cable to connect to the router

Initial Setup:

Run the Ansible Playbook

  • download this repository to your control machine with git clone https://github.com/Maximilian-Pichler/BDPi
  • open the hosts.ini file and put the IP-Adress of your RPi in the second line.
  • if you use additional storage, add the UUID and the Format-Type of your drive here too, otherwise leave these strings empty.
  • if you don't want to install the E2E-demo, then go to ansible-playbook playbook.yml and delete the line - role: demo
  • execute ansible-playbook playbook.yml from the BDPi-Repository folder on your control machine.
  • get something to drink...or eat. This will take a while.*
    • The time it takes to run the playbook heavily depends on your boot drive. A USB 3 stick is the preferred choice and reduces the time needed to approx. 45min

Have Fun

Once the installation is finished, the services are listening on the following ports:

Service Port Password / Token user
SparkHub* 4040
Jupyter Lab 8881 abcd
Maria DB 3306 abcd ubuntu
Superset 8088 abcd ubuntu
Network Storage 445 abcd ubuntu

*only with an active Spark-Job

From a security standpoint this configuration is not ideal, but makes things easier to access.


WHY ANSIBLE?

  • Because I wanted to learn it after reading about its capabilities. (Deploy a Kafka Cluster with Ansible | short introduction video)
  • Because I mess things up and automation makes starting from scratch less painful.
  • Because I'm lazy. Ansible-Playbooks are easy-to-read/understand and reduce the need for documentation.

Demo

If you installed the demo you can find instructions here.


UPCOMING FEATURES (maybe)

  • Continuous Integration Workflow on selfhosted GitLab
  • Spark Cluster Integration
  • VPN for remote access