Reference

Technics

Important thing is good set of smart features and diverse set of base algorithms.
A lot of features based on division and substraction from the application_train.csv
- The most notable division was by EXT_SOURCE_3
The most important features that I engineered, in descending order of importance (measured by gain in the LGBM model)
Find data structure, understand column description, mannagement of the feature

Outlook

Use feather.file
Feature engineering by using script file
How to feature selection -> using LGBM importance ??
Using the predictive value of such regression as features

Flow Chart

Home_Credit_Kaggle

Rmd.file

0_EDA.Rmd: Checking data simply and searching problem
1_Preprocess_app.Rmd: Preprocessing for application_{train|test}.csv
1_Preprocess_bureau.Rmd: Preprocessing for bureau.csv and bureau_balance.csv (not changed)
1_Preprocess_pre_app.Rmd: Preprocessing for previous_applications.csv (not changed)
1_Preprocess_ins_pay.Rmd: Preprocessing for installments_payment.csv (not changed)
1_Preprocess_pos_cash.Rmd: Preprocessing for POS_CASH_balance.csv (not changed)
1_Preprocess_credit.Rmd: Preprocessing for credit_card_balance.csv (not changed)
2_Combine.Rmd: Combining all data and Checking for data (not changed)
3_XGBoost.Rmd: construct xgboost model, predict, make a submit file, search best features, parameter tune (not changed)

jn.file

LightGBM.ipynb: lightgbm, cross validation, predict
NeuralNetwork.ipynb: neural network, predict

py.file

script.file

function.R: Descrive detail of functions
makedummies.R: Make factor values dummy variables

submit.file

file_name + submit_date.csv

input

csv.file

raw data

csv_imp0.file

{...}.csv: Apply basic preprocess
all_{train|test}.csv: Combine all tables

csv_imp1.file

{...}_imp.csv: Complement missing values, Extract features
all_{train|test}.csv: Combine all tables

data.file

best_para.tsv: recorded best features
score_sheet.tsv: train auc, test auc, LB score
Flowchart.eddx, FlowChart.png: Illustrate the process chart
about_column.numbers: Explain all table columns

Layered Directory

├── Home_Credit_Kaggle.Rproj
├── README.md
├── Rmd
│   ├── 0_EDA.Rmd
│   ├── 1_Preprocess_app.Rmd
│   ├── 1_Preprocess_app.html
│   ├── 1_Preprocess_bureau.Rmd
│   ├── 1_Preprocess_credit.Rmd
│   ├── 1_Preprocess_ins_pay.Rmd
│   ├── 1_Preprocess_pos_cash.Rmd
│   ├── 1_Preprocess_pre_app.Rmd
│   ├── 2_Combine.Rmd
│   └── 3_XGBoost.Rmd
├── input
│   ├── csv
│   │   ├── HomeCredit_columns_description.csv
│   │   ├── POS_CASH_balance.csv
│   │   ├── application_test.csv
│   │   ├── application_train.csv
│   │   ├── bureau.csv
│   │   ├── bureau_balance.csv
│   │   ├── credit_card_balance.csv
│   │   ├── installments_payments.csv
│   │   ├── previous_application.csv
│   │   └── sample_submission.csv
│   ├── csv_imp0
│   │   ├── all_data_test.csv
│   │   ├── all_data_train.csv
│   │   ├── POS_CASH_balance.csv
│   │   ├── application_test.csv
│   │   ├── application_train.csv
│   │   ├── bureau.csv
│   │   ├── bureau_balance.csv
│   │   ├── credit_card_balance.csv
│   │   ├── installments_payments.csv
│   │   └── previous_application.csv
│   └── csv_imp1
│       ├── all_data_test.csv
│       ├── all_data_train.csv
│       ├── application_test_imp.csv
│       ├── application_train_imp.csv
│       └── credit_card_balance_imp.csv
├── data
│   ├── best_para.tsv
│   ├── best_para_old_100.tsv
│   ├── about_column.numbers
│   ├── FlowChart.eddx
│   ├── FLowChart.png
│   └── score_sheet.tsv
├── jn
│   ├── LightGBM.ipynb 
│   └── NeuralNetwork.ipynb 
├── py
│   ├──  
│   └── 
├── submit
└── script
   ├── function.R
   └── makedummies.R

takuto0831/Home_Credit_Kaggle

Reference

Technics

Outlook

Flow Chart

Home_Credit_Kaggle

Rmd.file

jn.file

py.file

script.file

submit.file

input

csv.file

csv_imp0.file

csv_imp1.file

data.file

Layered Directory