Dataset used from Citi Bike Trip New York City Year 2020 Citi Bike Trip Dataset
Dataset format
- Trip Duration (seconds)
- Start Time and Date
- Stop Time and Date
- Start Station Name
- End Station Name
- Station ID
- Station Lat/Long
- Bike ID
- User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)
- Gender (Zero=unknown; 1=male; 2=female)
- Year of Birth
For this project, i've chosen this dataset citi bike trip new york. This data set is available on Dataset and updated every monthly. The goal was to develop dashboard contain trip summary & user distribution by user type/Gender
Data pipeline used with batch which is run periodically (monthly)
- Create Data Pipeline with 4 : Step Download_dataset_task (Zip Format) -> Unzip_data_task (From zip to csv) -> Remove_zip_task (Delete zip file) -> Format_to_parquet_task (Change format from CSV to parquet) -> local_to_gcs_task (upload parquet to Data Lake/GCS)
- Cloud : GCP
- IaC : Terraform for making Bucket in GCS & Config on BigQuery
- Workflow orchestration : Runnning Airflow on container(Docker)
- Data Warehouse : BigQuery
- Just doing some simple SQL Transformation
Create dashboard with 3 Tile (1 Bar & 2 Pie) from bike trip dataset year 2020
- Summary most popular route taken by gender
- Distribution by usertype
- Monthly summary trip