Big Data Project

This project illustrates an approach of creating an end to end pipeline for a big data project for academic purposes.

Requirements

To start things off, you need to to have the following setup/installed:

docker & docker compose
kubectl
helm
Task (optional)
Make (optional)
node

Getting started

Start the services

The steps below are just an overview, and the full documentation can be found in the submodules themselves.

Start a hadoop cluster

(cd docker-hadoop && make build && make up)

Start spark and cassandra

(cd cassandra-spark-docker && task up)

Create the output tables

Open a shell directly to cassandra

(cd cassandra-spark-docker && task cqlsh)

Create the cleaned_data keyspace and the salbes_alanytics table

CREATE KEYSPACE cleaned_data WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE cleaned_data;
CREATE TABLE sales_analytics (
    store text PRIMARY KEY,
    total_sales double
);

DESCRIBE TABLE sales_analytics;
SELECT * FROM sales_analytics;

-- Truncate the table
TRUNCATE sales_analytics;

Run the pyspark process

(cd cassandra-spark-docker/examples && task build-python && task run-python)

You can verify using the SELECT command above in the cqlsh to see the populated table.

Explore the data

Copy the .env.example

(cd my-explorer && cp .env.example .env.local)

Install the dependencies

(cd my-explorer && yarn install)

Run the Next.JS application

(cd my-explorer && yarn dev)

usersina/big-data-project