An attempt to stream multiple subreddits from the reddit api using kafka & spark, and store in s3 data lake as delta tables. nightly glue jobs processes raw data into clean data partitioned by year/month/day in athena.x
Go to docker directory and run build script.
Run docker-compose.
docker-compose up -d --no-recreate
Access jupyterlab shell. Can also attach VSCode to the jupyterlab container.
docker exec -it jupyterlab bash
Activate the virtual environment.
source redditStreaming/reddit-env/bin/activate
Go to reddit directory.
cd redditStreaming/src/reddit
Start pyspark streaming application.
python3 -m
- java gateway exited before sending port number: make sure java home is set, java -version is 1.8
Start the kafka producer.
python3 -m
- stored cluster id xyz does not match: go to /cluster_config/kafka/logs/ and change to correct cluster id
Remove untagged docker images.
docker rmi $(docker images | grep "^<none>" | awk "{print $3}")
Prune docker system volumes, containers & images.
docker system prune && docker volume prune && docker container prune && docker image prune
When changing version of spark, hadoop, jupyterlab, etc, versions must be updated in
, respective *.Dockerfile
, requirements.txt
Likely caused by guava jar mismatch, follow steps here:
If there are kafka errors, run docker-compose down
, delete cluster_config/kafka/logs
and cluster_config/zookeeper/data/version-2
directories, run docker-compose up -d
s3 artifact directory.
configure aws cli first.
aws configure
aws s3 sync redditStreaming/src/main/python/scripts/ s3://reddit-streaming-stevenhurwitt/scripts/
aws s3 sync s3://reddit-streaming-stevenhurwitt/scripts/ redditStreaming/src/main/python/scripts/.
Build wheel file.
cd redditStreaming/src/main/python/reddit && python3 bdist_wheel
Docker Compose.
docker-compose down && ./ && docker-compose up -d
docker exec -it jupyterlab bash
cd terraform && tf plan -out tfplan && tf apply tfplan
- lambda function to backup s3 to local daily (aws s3 sync...)
- glue function for s3 to docker postgres (aws is $6 a day??? could use a smaller instance?)
- airflow to gracefully restart streaming and producer jobs as needed
- could move from docker-compose local streaming app to cloud based
- kubernetes cluster w/ raspberry pis and local pc