/deu-bil3003-ps2

Decision tree generation with CART algorithm (Introduction to Data Mining / Problem set 2)

Primary LanguagePython

Introduction to Data Mining - Problem set 2

pipeline status

Goal is to generate and prune binary split classification trees with an implementation of CART algorithm. See problem set description.

Technologies used in this project:

  • Python 3.8 (Runs with 3.7)
  • GitLab CI
  • Git

Running

To run, use the main.py file, there's no dependencies:

$ python3.8 main.py
Output
# Decision Tree #
(credit_history in {delayed previously, existing paid, critical/other existing credit})
├(T)─ (credit_amount <= 7882.0)
│     ├(T)─ (credit_history in {delayed previously, existing paid})
│     │     ├(T)─ (property_magnitude in {real estate})
│     │     │     ├(T)─ (credit_amount <= 1768.0)
│     │     │     │     ├(T)─ good
│     │     │     │     └(F)─ (age <= 21.0)
│     │     │     │           ├(T)─ bad
│     │     │     │           └(F)─ good
│     │     │     └(F)─ good
│     │     └(F)─ (age <= 34.0)
│     │           ├(T)─ (employment in {1<=X<4, 4<=X<7})
│     │           │     ├(T)─ good
│     │           │     └(F)─ (credit_amount <= 2578.0)
│     │           │           ├(T)─ (age <= 28.0)
│     │           │           │     ├(T)─ good
│     │           │           │     └(F)─ bad
│     │           │           └(F)─ (employment in {<1, unemployed})
│     │           │                 ├(T)─ (property_magnitude in {real estate, no known property})
│     │           │                 │     ├(T)─ good
│     │           │                 │     └(F)─ bad
│     │           │                 └(F)─ good
│     │           └(F)─ good
│     └(F)─ bad
└(F)─ bad

# Test Result # Accuracy: 0.72 TP rate: 0.7345132743362832 TN rate: 0.5833333333333334 TP count: 166 TN count: 14

Notes

  • Parsing aggregates records to a set, non-linear data structure. Therefore if there's multiple best splits with same gain, sometimes different and sometimes same �trees will show up for different executions.�
  • It is assumed that the .csv file indicates class tag as the last value.