A significant issue in medical setting is patients failing to attend scheduled doctor appointments despite receiving instructions (no-shows). Our client, a medical ERP solutions provider, seeks to tackle this by introducing a machine learning model into their software. This model aims to predict patient attendance, enabling medical providers to optimize appointment management.
The dataset from Kaggle utilized in this project comprises appointment records from medical institutions, capturing various attributes related to patients and their appointments. Key features include:
-
Patient demographics: Age and gender.
-
Health characteristics: The presence of conditions such as diabetes or hypertension.
-
Appointment-specific details: Scheduled and appointment dates, and whether the patient received a reminder SMS.
-
Target: Binary indicator representing whether a patient was a no-show or attended their appointment.
No Column Name Description 01 PatientId Identification of a patient 02 AppointmentID Identification of each appointment 03 Gender Male or Female. Female is the greater proportion, women take way more care of their health in comparison to men. 04 ScheduledDay The day someone called or registered the appointment, this is before the appointment of course. 05 AppointmentDay The day of the actual appointment, when they have to visit the doctor. 06 Age How old is the patient. 07 Neighbourhood Where the appointment takes place. 08 Scholarship True or False. Indicates whether the patient is enrolled in Brasilian welfare program Bolsa Família. 09 Hipertension True or False. Indicates if the patient has hypertension. 10 Diabetes True or False. Indicates if the patient has diabetes. 11 Alcoholism True or False. Indicates if the patient is an alcoholic. 12 Handcap True or False. Indicates if the patient is handicapped. 13 SMS_received True or False. Indicates if 1 or more messages sent to the patient. 14 No-show True or False (Target variable). Indicates if the patient missed their appointment.
The approach to solving the challenge of predicting patient no-shows involved a comprehensive workflow, focusing on both the development of a predictive model and its practical application within an existing system. Here's an overview of the approach taken:
-
Data Storage and Initial Analysis: Utilized Snowflake for secure data storage and conducted exploratory data analysis (EDA) to understand the dataset's characteristics and identify potential predictive features.
-
Data Loading and Preprocessing: Initial steps involved loading the data from Snowflake, followed by preprocessing tasks such as handling missing values, encoding categorical variables, and normalizing features to prepare the dataset for modeling.
-
Feature Engineering and Selection: Engineered meaningful features from the raw data, such as calculating the time interval between scheduling and appointment dates. The selection of features was based on their importance as determined through analysis using logistic regression and decision tree models, focusing on retaining features with at least 1% importance from either model.
-
Dataset and Model Selection: Various machine learning algorithms, including Logistic Regression, Decision Tree, Random Forest, and XGBoost, were evaluated across different datasets (original, upsampled, downsampled, and SMOTE-enhanced) to identify the best-performing model. XGBoost emerged as the optimal choice, particularly when trained on the original dataset, after adjusting the
scale_pos_weight
parameter to address class imbalance effectively. -
API Development for Model Deployment: Created an API for the model, facilitating its integration into the client's ERP system. This step involved deploying the model to AWS SageMaker, setting up an AWS Lambda function for model invocation, and configuring an Amazon API Gateway to expose the model as a RESTful service.
-
Testing and Validation: Conducted thorough testing of the deployed model using Postman, validating its functionality and ensuring its readiness for real-world application.
The project is organized into several directories and files, each serving a specific purpose in the development, deployment, and documentation of the machine learning model.
Below is an overview of the project structure and the contents of each component:
Medical-Appointment-No-Show-Prediction
├── data/
│ ├── input/ # Raw data files.
│ ├── processed/ # Data files that have been cleaned and preprocessed.
│ ├── output/ # Output data files, including model predictions.
│ ├── features/ # Contains the important features used for filtering the data.
│ └── hyperparameters/ # Contains the best hyperparameters obtained from Hyperopt tuning.
├── src/
│ ├── data_loader.py # Script for loading and preprocessing data.
│ ├── preprocessing.py # Script containing data preprocessing functions.
│ ├── feature_engineering.py # Script for feature engineering tasks.
│ ├── modeling.py # Contains model training, evaluation, and prediction scripts.
│ ├── train.py # Main script for training the model.
│ ├── predict.py # Script for making predictions using the trained model.
│ ├── requirements.txt # Lists the Python dependencies required for the project.
│ └── snowflake_creds.py # Contains credentials for Snowflake database access.
├── model/ # Trained model files and artifacts.
├── deployment_assets/ # Files and scripts used for deploying the model.
├── Snowflake_assets/ # Original data file for database creation and SQL queries for exploratory analysis.
├── Project Notebook.ipynb # Jupyter notebook detailing the model development process.
├── Project Documentation.pdf # Comprehensive documentation of the project.
├── Model Deployment.ipynb # Jupyter notebook detailing the model deployment process.
-
Clone the Repository
Clone the project repository to local machine.
git clone https://github.com/TimKong21/Medical-Appointment-No-Show-Prediction.git cd Medical-Appointment-No-Show-Prediction
-
Set Up a Virtual Environment
Create and activate a virtual environment to manage the project's dependencies.
# Create a virtual environment python -m venv env
Activate the virtual environment.
# On Windows env\Scripts\activate # On MacOS/Linux source env/bin/activate
-
Install Dependencies
Install the required Python dependencies.
pip install -r src/requirements.txt
-
Model Training
Train the model and make predictions.
cd src python train.py python predict.py
-
Model Deployment
For deploying the model to AWS SageMaker and setting up the necessary AWS services for model invocation and API exposure, follow step 1 to step 6 on
Model Deployment.ipynb
. This notebook provides detailed steps for deploying the model to AWS SageMaker, creating an AWS Lambda function, and configuring an Amazon API Gateway to expose the model as a RESTful service. -
Testing and Validating with POSTMAN
After deployment, follow step 7 on
Model Deployment.ipynb
to test and validate the model's functionality using POSTMAN. This involves sending requests to the deployed model's API endpoint and verifying the responses to ensure the model operates as expected.
For a comprehensive understanding of the project, refer to:
-
Project Notebook.ipynb
for detail model development process. -
Model Deployment.ipynb
for detail model deployment process. -
Project Documentation.pdf
for comprehensive project documentation.