PwC data science hackathon.
Given a large dataset of financial transactions of customers , how do we predict the expected loan repayment date?
- Receive Data
- Convert data into computable form
- Feature filter
- Classification/ Regression algorithms
- Analysis of Result
- Sklearn - logistic regression and xgboost model
- Numpy - for data processing and visualisation
- Pandas - for data processing
- Keras - neural network model
Csv file containing 20k+ rows.
Data contains 20 features which are
['InvoiceAmount', 'DocumentNo', 'PaymentDocumentNo', 'InvoiceItemDesc', 'ReferenceDocumentNo', 'duration', 'InvoiceDesc', \
'Vendor 00332', 'UserName', 'TransactionCode', 'Vendor 01089', 'Vendor 00415', 'Period', 'Vendor 01024', 'Vendor 00070', \
'CompanyCode', 'TransactionCodeDesc', 'Vendor 01689', 'PO_FLag', 'Vendor 01532']
Filling missing data with median data
String data to one hot encoding
By checking colinearity of feature through VIF
By checking correlation heat map
Feature extraction using PCA
Logistic regression using
Support Vector Machine
Xgboost
Neural Network
Check confusion matrix
Check the R square score
Is it the problem of overfitting?
If so, Possible approaches:
1. Oversampling.
2. Early stopping by evaluating log loss.
3. Feature reduction
4. Regularization
5. Stratify data ( split data according to the labels proportion )
What if it is not the problem of overfitting?
1. Preprocessing work-flow
2. Retrieve Data
3. temp_data_process ( imputation with median or zeros )
4. Oversampling with SMOTE ( Enable minor data to be sampled)
5. Scale data to small range numbers