/bigdata_proj

The project of big data class.

Primary LanguageJupyter Notebook

1 Introduction

The project of big data class.

1.1 Tasks

  1. Design a deep learning model to predict the credit level of each customer.
  2. Distribute the whole dataset into 5 different parties in independent and identical distribution and then adapt a federated learning framework to do model training and aggregation. Show the superiority of collaborative training over individual local training.

1.2 Dataset

BankChurners.csv contains basic information of 9000 bank’s customers and the target variable is the credit level between 1(bad) to 10(excellent):

  • CustomerId – unique Ids for bank customer identification.
  • Geography – the country from which the customer belongs.
  • Tenure – number of years for which the customer has been with the bank.
  • Balance – bank balance of the customer.
  • NumOfProducts – number of bank products the customer is utilizing.
  • HasCrCard – binary flag for whether the customer holds a credit card or not.
  • IsActiveMember – binary flag for whether the customer is an active member or not.
  • EstimatedSalary – estimated salary of the customer in Dollars.
  • Exited – binary flag 1 if the customer closed an account with the bank and 0 if the customer is retained.
  • CreditLevel – credit level of the customer

New_BankChurners.csv contains basic information of 1000 new bank’s customers and the credit level is unknown.

1.3 File Structure

  • data_processing.ipynb: data preprocessing
  • deep_learning.py: training with one DNN model,accuracy:about 20%
  • deep_learning_subclass.py: In this file, we use one subclass model and three clarifiers to train, accuracy about 21.6%
  • ml_with_nn.py: deep learning + machine learning, 37.22%
  • federated_learning.py: federated learning module, include iid and non-iid, FedAvg and so on.
  • federated_learning_noiid: improvements of FL with non-iid data.

2 Data Processing/Analytics

The procedure of data processing and analytics can be divided into three parts, including exploratory data analysis, data preprocessing and feature engineering. Details can be seen at data_processing.ipynb

3 Model design and implementation

3.1 training with one model

The implementation can be seen at deep_learnig_full.py

3.2 1 subclass model + 3 classifiers

The implementation can be seen at deep_learning_subclass.py file

3.3 deep learning + machine learning

The implementation can be seen at ml_with_nn.py file

4 Framework of federated learning

Federated Learning process

  • divide dataset by IID or non-IID
  • foreach round:
    • training on each clients with model and save weights
    • update the model with avg_weig
  • predict

4.1 Data partition

  1. get IID: using bank_iid function or set α to 1000 use dirichlet_partition to get dataset

  2. get non-IID: set α to 0.1 and use dirichlet_partition to get dataset

    run federated_learning.py

  • IID data divided dataset randomly IID_simple setting α to 1000 image

  • non-IID data(setting α to 0.1) image

4.2 Federated Learning Result

  • IID result(federated_learning.py) image

  • non-IID result(federated_learning.py) fl_non

  • improved non-IID result(federated_learning_iid.py) non_iid_improve