Working on the no-show dataset (https://www.kaggle.com/joniarroba/noshowappointments), a dataset of patient appointments, we attempt to predict whether or not a patient will show up for their appointment. Only about ¼ of the patients are no-shows, and in this repo, we show that by generating more no-shows, we can improve the performance of patient no-show classifiers. See the results below:
Original dataset results:
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
catboost | CatBoost Classifier | 0.8026 | 0.7461 | 0.0778 | 0.5843 | 0.1372 | 0.0942 | 0.1582 | 14.866 |
lightgbm | Light Gradient Boosting Machine | 0.8015 | 0.7433 | 0.0376 | 0.6444 | 0.0711 | 0.05 | 0.1204 | 39.915 |
xgboost | Extreme Gradient Boosting | 0.8003 | 0.7431 | 0.092 | 0.5332 | 0.1569 | 0.1035 | 0.1567 | 6.864 |
rf | Random Forest Classifier | 0.8022 | 0.7411 | 0.1601 | 0.5339 | 0.2463 | 0.169 | 0.21 | 4.068 |
gbc | Gradient Boosting Classifier | 0.7984 | 0.7332 | 0.0067 | 0.6078 | 0.0132 | 0.0086 | 0.0463 | 3.843 |
ada | Ada Boost Classifier | 0.7976 | 0.7282 | 0.0168 | 0.463 | 0.0323 | 0.0186 | 0.0557 | 0.924 |
et | Extra Trees Classifier | 0.7905 | 0.726 | 0.1991 | 0.4573 | 0.2773 | 0.1765 | 0.1974 | 6.206 |
lda | Linear Discriminant Analysis | 0.791 | 0.681 | 0.0436 | 0.3569 | 0.0776 | 0.0353 | 0.0613 | 1.368 |
lr | Logistic Regression | 0.7954 | 0.6784 | 0.025 | 0.398 | 0.0471 | 0.0236 | 0.0552 | 5.518 |
knn | K Neighbors Classifier | 0.7778 | 0.6744 | 0.2076 | 0.403 | 0.2739 | 0.1583 | 0.1705 | 9.431 |
nb | Naive Bayes | 0.2345 | 0.5988 | 0.9611 | 0.204 | 0.3365 | 0.005 | 0.0206 | 0.087 |
dt | Decision Tree Classifier | 0.7344 | 0.5862 | 0.3361 | 0.3404 | 0.3382 | 0.1721 | 0.1721 | 0.482 |
qda | Quadratic Discriminant Analysis | 0.5235 | 0.5083 | 0.4827 | 0.2066 | 0.2772 | 0.0098 | 0.0144 | 7.784 |
svm | SVM - Linear Kernel | 0.7981 | 0 | 0 | 0 | 0 | 0 | 0 | 0.218 |
ridge | Ridge Classifier | 0.7976 | 0 | 0.0092 | 0.4512 | 0.018 | 0.01 | 0.0396 | 0.077 |
Results with synthetic dataset:
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
catboost | CatBoost Classifier | 0.86 | 0.9333 | 0.7607 | 0.9283 | 0.7721 | 0.7213 | 0.7438 | 20.435 |
xgboost | Extreme Gradient Boosting | 0.8552 | 0.93 | 0.7629 | 0.9094 | 0.7726 | 0.7116 | 0.7324 | 8.432 |
rf | Random Forest Classifier | 0.8521 | 0.9292 | 0.7843 | 0.8887 | 0.7896 | 0.7052 | 0.7262 | 6.809 |
lightgbm | Light Gradient Boosting Machine | 0.8534 | 0.9273 | 0.7582 | 0.9005 | 0.7707 | 0.7079 | 0.7257 | 1.061 |
et | Extra Trees Classifier | 0.8393 | 0.9183 | 0.8018 | 0.8482 | 0.7943 | 0.6794 | 0.6974 | 11.339 |
gbc | Gradient Boosting Classifier | 0.8092 | 0.9063 | 0.7941 | 0.8015 | 0.7782 | 0.619 | 0.6325 | 6.779 |
knn | K Neighbors Classifier | 0.8018 | 0.8786 | 0.7211 | 0.8423 | 0.7494 | 0.6045 | 0.6188 | 22.901 |
ada | Ada Boost Classifier | 0.7615 | 0.8589 | 0.776 | 0.7462 | 0.7521 | 0.5231 | 0.5322 | 1.469 |
lr | Logistic Regression | 0.7455 | 0.8153 | 0.7272 | 0.7437 | 0.7262 | 0.4913 | 0.4969 | 5.689 |
lda | Linear Discriminant Analysis | 0.7457 | 0.8148 | 0.7273 | 0.7446 | 0.727 | 0.4918 | 0.4973 | 2.22 |
dt | Decision Tree Classifier | 0.8098 | 0.8101 | 0.796 | 0.8 | 0.7767 | 0.6201 | 0.6349 | 0.719 |
nb | Naive Bayes | 0.5981 | 0.7097 | 0.2909 | 0.7307 | 0.4108 | 0.1992 | 0.2414 | 0.125 |
qda | Quadratic Discriminant Analysis | 0.5032 | 0.5046 | 0.2655 | 0.5167 | 0.3194 | 0.0092 | 0.0121 | 2.023 |
svm | SVM - Linear Kernel | 0.7413 | 0 | 0.7165 | 0.745 | 0.7236 | 0.483 | 0.4875 | 0.374 |
ridge | Ridge Classifier | 0.7457 | 0 | 0.7273 | 0.7446 | 0.727 | 0.4918 | 0.4973 | 0.088 |