The objective of this project is to predict, 30 days in advance, whether a patient is eligible to take a "Target Drug." This involves creating a positive set for model development and implementing feature engineering techniques for model building.
Before you begin, you will need to have a few tools and libraries installed on your machine:
- Python 3.7 or higher.
- Sklearn, XGBoost, Pandas
- The dataset comprises records for approximately 27,033 unique patients and 57 different types of incidents, spanning over five and a half years.
- The patient with patient_id
a0ddfd2c-1c7c-11ec-876d-16262ee38c7f
was involved in the most number of incidents (1645). - The most frequent incident recorded was the administration of
DRUG_TYPE_6
(549,616 occurrences). - The incidents are recorded between
2015-04-07
and2020-09-03
.
- Brainstormed ideas and devised a plan for tackling the problem statement.
- Gained a comprehensive understanding of the dataset, noting any potential issues such as duplication and inconsistencies.
- Addressed data duplication and index-related issues.
- Explored the dataset by examining unique value counts and descriptive statistics to gain insights.
- Cleverly separated the incidents related to patients who took the "Target Drug" by considering the 30-day prediction window.
- Merged the segregated set with the original dataset to handle both positive and negative sets effectively.
- Leveraged Pandas groupby(), size(), and unstack() methods to engineer 57 feature columns.
- Created a binary classification target variable by marking patients who had taken the "Target Drug."
- Addressed data scaling by using a StandardScaler to prepare the data for machine learning algorithms.
- Detected class imbalance and used an Oversampler to address the imbalance issue.
- Explored various machine learning models and selected the GradientBoostingClassifier.
- Performed hyperparameter tuning using GridSearchCV to find the best model parameters.
- Evaluated the model's performance on a validation set and got 80% accuracy with the best possible algorithm and best possible parameters.
- Processed the test dataset similarly to the training data.
- Used the trained model to predict labels for the test dataset.
- Attempted to improve model performance using PCA and deep learning techniques but with limited success.
- Explained the challenges faced during the model development process in the notebook.
- Explore more advanced feature engineering techniques.
- Investigate other machine learning algorithms and deep learning architectures.
- Experiment with different strategies for addressing class imbalance.
If you have any questions, comments, or suggestions for the project, please feel free to contact me at [nirmal.works@outlook.com]