MAST30034 Applied Data Science - Group 5
Group members:
- John Melluish (1171367)
- Rebekah Wu (1074412)
- Seen Yan Lee (1183422)
- Sen Turner (1168692)
- Yi Xiang Chee (1165917)
-
download.py
: Downloads external datasets- requires argument --output: the path from current directory to the desired output file (the external data folder)
-
ETL_basic.py
: Performs ETL on transaction and merchant data (Output:joined_data.parquet
,consumer_fraud.parquet
,merchant_fraud.parquet
)- requires argument --curatedpath: that path from current directory to the curated folder & --tablespath: the path from current directory to the tables folder
-
external_connection.py
: Joins data with external datasets (Input:joined_data.parquet
, Output:external_data.parquet
,external_joined_data.parquet
)- requires argument --curatedpath: that path from current directory to the curated folder & --externalpath: the path from current directory to the external folder
-
null_imputation.py
: Performs imputation on null values (Input:external_joined_data.parquet
, Output:final_data.parquet
)- requires argument --curatedpath: the path from current directory to the curated folder
-
analysis.ipynb
: Performs outlier removal and analysis (Input:final_data.parquet
,external_data.parquet
, Output:cleaned_data.parquet
) -
merchant_forecast.ipynb
: Produces forecasts on merchants' performance for the next 365 days (Input:cleaned_data.parquet
, Output:future_predictions.csv
) -
fraud_regression.ipynb
: Produces predictions for consumer and merchant fraud given the known fraud cases, highlighting suspicious fraud cases (Input:merchant_fraud.parquet
,consumer_fraud.parquet
, Output:merchant_fraud_rate.csv
,consumer_fraud_rate.csv
) -
merchant_fraud.ipynb
: Generates insights on merchant based on their fraud rate (Input:merchant_fraud.parquet
) -
ranking_features.ipynb
: Produces feature table to be used for ranking merchants, (Input:cleaned_data.parquet
,merchant_fraud_rate.csv
,consumer_fraud_rate.csv
,future_predictions.csv
, Output:merchant_ranking_properties.csv
) -
one_hot.ipynb
: Performs clustering on merchants based on given tags to produce merchant segments (Input:tbl_merchant.parquet
, Output:segmented_merchants.csv
) -
merchant_ranking.ipynb
: Produces ranking model and uses it to rank merchants (Input:merchant_ranking_properties.csv
,segmented_merchants.csv
) Note: this notebook is also to be used for Checkpoint 6 -
final_summary.ipynb
: Summarises the project's findings