A proof-of-concept (PoC) project for serverless Business Intelligence (BI) pipeline for non-relational data, using AWS tech stack:
- AWS S3 - data lake
- AWS Glue - data crawler (schema discovery tool)
- AWS Athena - SQL query engine
- AWS QuickSight - visualization tool
A few scripts in data_preprocessing_scripts
directory for cleaning and preprocessing the Yelp dataset.
The code assumes the following directories structure:
/data_preprocessing_scripts/data/raw_data
- contain rawbusiness.json
,review.json
anduser.json
from the Yelp dataset/data_preprocessing_scripts/data
- contain (empty)csv_data
,json_data
,json_data_schemas
andparquet_data
directories
This script performs the initial cleaning of the Yelp dataset before further analysis:
yelp_academic_dataset_business.json
:- reformatting due to inconsistencies of data, e.g. change
u'"value"'
to"value"
- change
categories
` from comma-separated string to list - change
hours
todays_open
: drop hours information, use only days of week - drop
address
,postal_code
andis_open
- heavy changes to
attributes
- selection of only a few, make data types more consistent
- reformatting due to inconsistencies of data, e.g. change
yelp_academic_dataset_checkin.json
:- dropped, will not be used
yelp_academic_dataset_review.json
:- drop review text
- drop
review_id
- drop
date
time information, use hour only
yelp_academic_dataset_tip.json
:- dropped, will not be used
yelp_academic_dataset_user.json
:- drop
friends
andname
- drop
yelping_since
time information, use hour only - change
elite
from comma-separated string to list
- drop
Results are saved in /data/json_data
. They are in default Athena format,
i.e. JSON stream (list of JSONs, one per line, separated with newlines).
This script crawls cleaned JSONs from /data/json_data
and saves their schemas
in /data/json_data_schemas
. This is done to:
- explore the schemas
- debugging data cleaning script, avoiding paying for AWS Glue Crawler just to discover a bug
- validate data types discovered by Glue
This script converts collections of JSONs, created with clean_data.py
, to
CSV or Apache Parquet. Results are saved in /data/csv_data
or /data/parquet_data
,
depending on target format.
SQL queries for usage in AWS Athena for benchmarking the approaches are in SQL_queries
directory.