This project illustrates an approach of creating an end to end pipeline for a big data project for academic purposes.
To start things off, you need to to have the following setup/installed:
The steps below are just an overview, and the full documentation can be found in the submodules themselves.
- Start a
hadoop
cluster
(cd docker-hadoop && make build && make up)
- Start
spark
andcassandra
(cd cassandra-spark-docker && task up)
Open a shell directly to cassandra
(cd cassandra-spark-docker && task cqlsh)
Create the cleaned_data
keyspace and the salbes_alanytics
table
CREATE KEYSPACE cleaned_data WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE cleaned_data;
CREATE TABLE sales_analytics (
store text PRIMARY KEY,
total_sales double
);
DESCRIBE TABLE sales_analytics;
SELECT * FROM sales_analytics;
-- Truncate the table
TRUNCATE sales_analytics;
(cd cassandra-spark-docker/examples && task build-python && task run-python)
You can verify using the SELECT
command above in the cqlsh
to see the populated table.
Copy the .env.example
(cd my-explorer && cp .env.example .env.local)
Install the dependencies
(cd my-explorer && yarn install)
Run the Next.JS application
(cd my-explorer && yarn dev)