Goal is to generate and prune binary split classification trees with an implementation of CART algorithm. See problem set description.
Technologies used in this project:
- Python 3.8 (Runs with 3.7)
- GitLab CI
- Git
To run, use the main.py file, there's no dependencies:
$ python3.8 main.py
Output
# Decision Tree # (credit_history in {delayed previously, existing paid, critical/other existing credit}) ├(T)─ (credit_amount <= 7882.0) │ ├(T)─ (credit_history in {delayed previously, existing paid}) │ │ ├(T)─ (property_magnitude in {real estate}) │ │ │ ├(T)─ (credit_amount <= 1768.0) │ │ │ │ ├(T)─ good │ │ │ │ └(F)─ (age <= 21.0) │ │ │ │ ├(T)─ bad │ │ │ │ └(F)─ good │ │ │ └(F)─ good │ │ └(F)─ (age <= 34.0) │ │ ├(T)─ (employment in {1<=X<4, 4<=X<7}) │ │ │ ├(T)─ good │ │ │ └(F)─ (credit_amount <= 2578.0) │ │ │ ├(T)─ (age <= 28.0) │ │ │ │ ├(T)─ good │ │ │ │ └(F)─ bad │ │ │ └(F)─ (employment in {<1, unemployed}) │ │ │ ├(T)─ (property_magnitude in {real estate, no known property}) │ │ │ │ ├(T)─ good │ │ │ │ └(F)─ bad │ │ │ └(F)─ good │ │ └(F)─ good │ └(F)─ bad └(F)─ bad
# Test Result # Accuracy: 0.72 TP rate: 0.7345132743362832 TN rate: 0.5833333333333334 TP count: 166 TN count: 14
- Parsing aggregates records to a set, non-linear data structure. Therefore if there's multiple best splits with same gain, sometimes different and sometimes same �trees will show up for different executions.�
- It is assumed that the .csv file indicates class tag as the last value.