Are you going? I am not sure

... Lets think about it

MedCamp is a non-profit which provides health camps for people in various cities acorss America. MedCamp was having challenges maintaining their operations due to excessive operational expense.

Goal 1

Help MedCamp reduce wasteful spending AND maintain quality experiences of attendees by accurately predicting who will and will not attend a health-fare events.

Goal 2

Conduct AB testing to determine which model would have performed best in 'real time'

Data

This project, Healthcare Analytics, came from Kaggle

Anonymized Features: All data was anonymized

Protecting patient data is critical. However, it does make following this READ.md more difficult. I will reorient the reader throughout!

Kaggle Description

train.zip contains 6 different csv files apart from the data dictionary as described below:

Health_Camp_Detail.csv – File containing HealthCampId, CampStartDate, CampEndDate and Category details of each camp.

Health_Camp_ID	Camp_Start_Date	Camp_End_Date	Category1	Category2	Category3
6560	16-Aug-03	20-Aug-03	First	B	2
6530	16-Aug-03	28-Oct-03	First	C	2
6544	03-Nov-03	15-Nov-03	First	F	1
6585	22-Nov-03	05-Dec-03	First	E	2
6561	30-Nov-03	18-Dec-03	First	E	1

Train.csv & Test.csv – Both files have similar layouts, containing registration details for all the test camps. This includes PatientID, HealthCampID, RegistrationDate and a few anonymized variables as on registration date. Test.csv – File containing registration details for all the camps done after 1st April 2006. This includes PatientID, HealthCampID, RegistrationDate.

	Patient_ID	Health_Camp_ID	Registration_Date	Var1	Var2	Var5
0	489652	6578	10-Sep-05	4	0	2
1	507246	6578	18-Aug-05	45	5	7
2	523729	6534	29-Apr-06	0	0	0
3	524931	6535	07-Feb-04	0	0	0
4	521364	6529	28-Feb-06	15	1	7

Patient_Profile.csv – This file contains Patient profile details like PatientID, OnlineFollower, Social media details, Income, Education, Age, FirstInteractionDate, CityType and EmployerCategory

	Patient_ID	Income	Education_Score	Age	First_Interaction	City_Type	Employer_Category
0	516956	1	90	39	18-Jun-03		Software Industry
1	507733	1	None	40	20-Jul-03	H	Software Industry
2	508307	3	87	46	02-Nov-02	D	BFSI
3	512612	1	75	47	02-Nov-02	D	Education
4	521075	3	None	80	24-Nov-02	H	Others

First_Health_Camp_Attended.csv & Second_Health_Camp_Attended.csv – These files contain details about people who attended health camp of first format. This includes Donation (amount) & Health_Score of the person.

	Patient_ID	Health_Camp_ID	Donation	Health_Score
0	506181	6560	40	0.43902439
1	494977	6560	20	0.097560976
2	518680	6560	10	0.048780488
3	509916	6560	30	0.634146341
4	488006	6560	20	0.024390244

Third_Health_Camp_Attended.csv - This file contains details about people who attended health camp of third format. This includes Numberofstallvisited & LastStallVisitedNumber.

Patient_ID	Health_Camp_ID	Number_of_stall_visited	Last_Stall_Visited_Number
517875	6527	3	1
504692	6578	1	1
504692	6527	3	1
493167	6527	4	4
510954	6528	2	2

EDA

There were imbalanced classes among potential health camp attendees; specifically for each geographic location and among camps occurring within that location showed... all had different attendance rates. Thus, simply estimating attendance based on a global or local history would lead to poor results. Additionally, it is important to note that some patients attended more than one MedCamp health event.

37633	Unique Patient IDs
65	Unique Health Camps
20,534	Count of Patients Attending a Health Camp
15,011	Unique Patients Attending at least one Health Camp
102,000	Patient-Event Registrations
~ 20%	Global Attendance Rate
3	Classes or Types of Health Camps

Creating Target Variable Y

According to the description on Kaggle, MedCamp wanted to know the probability that a patient would successfully attend a health-fair event. For the first two camp types success was defined as getting a health score. For the third event-type success was going to at least one booth. The data from MedCamp was from several years and preliminary EDA showed that each patient could attend more than one Camp. Thus, to correctly create a target feature I needed to know the Camp ID,Patient ID, and if they successfully went to that event.

Primary Key

Given that each patient could attend more than one event, it was necessary to create a primary key for each patient & Health Camp combination by concatenating of the Patient and Camp ID.

Health Camp ID 6578	Patient ID 489652	Primary Key 4896526578

Creating this primary key was helpful in combining information and creating additional time features; meaningful data was spread among several csv files.

The need for Feature Engineering

Training the model with only the five anonymized features results in very poor performance.

The two anonymized features that had the highest feature weights were Var1 , Var5. Interestingly however, most of the patients had a zero-value for these two features. Without knowing what 'var1' is, and given that only a few thousand patients had non-zero values, I decided not to drop or edit these features for modeling purposes. There is simply not enough context to apply domain knowledge for the features Var1 - Var5.

Thus, feature engineering was instrumental in improving the model.

One Hot Encoding & Imputing

Categorical Features and Imputation

The categorical features include: City & Job Type. The binary categorical features pertained to if a patient shared their health fair attendance online through Twitter, LinkedIn, FaceBook, or were an online follower of MedCamp.

Most patients had many missing values for Job Type and other numerical features (discussed later). To avoid co-lineraity, I imputed 9999.0 for the missing values in the Job column.

Var1	Var5	y_target	Camp Start Date - Registration Date	Registration Date - First Interaction	Camp Start Date - First Interaction	Camp End Date - Registration Date	Camp Length	23384	Third	F	G	2100	9999.0
4.0	2.0	1.0	-25.0	278	253	34	59	1	1	0	1	1	1
0.0	0.0	0.0	-24.0	99	75	161	185	1	0	1	0	1	1
4.0	2.0	0.0	-60.0	355	295	711	771	1	0	1	0	1	1

Nearly 23,500 patients were missing the Camp Location. However, I was able to use the primary key to link a patient with a camp. Then, using sets, I was able to confirm that each camp ID is only associated with a certain city value by checking for unions and intersections among CampIDs, PatientIDs and Camp Length that were spread among the csv files. Thus, I was able to backtrack and impute missing city values for each patient which did improve prediction scores.

Numerical Features and Imputation

The numerical features provided by MedCamp were missing for most paitents. For example Age, Income and Education Score had less than 2,000 value each. Given that, 94% of the patients were missing all three values; imputing average values onto the other 35,000 patients for any numerical feature would be meaningless and create colinearity.

Features from Dates

I used the primary key to track the unique patient events and consolidate important information into csv that could be used for training and testing.

The following features were created:

	Feature Name (Days)
	Registration Date - First Interaction
	Camp Start Date - Registration Date
	Camp Start Date - First Interaction
	Camp End Date - Registration Date
	Camp Length

Modeling

Given the goal is to ensure all patients have an individualized health experience , there has to be specific supplies. Having accurate predictions means we can be confident in having the correct supplies and accomplishing the goal for improving health through individualized interventions.

Results after creating features, one-hot encoding, scaling

As shown above all models achelved a similar ROC score. However, when we take the number of false negatives and false positives into consideration going with the XG Boost model is the best choice.

The Date features ended up improving scores for all models. Additionally, for all but some iterations of Random Forests, the date/times features would show among top feature importances.

Post-Hoc

The global attendance rate was 20%. The training and validation attendance rate was 27%. However, 5/10 camp locations had a attendance rate between 32.2% and 33.8%. The highest attendance rate was just over 70%. The high level of variance helps to explain why adjusting to the exact glabal attendance rate, when dealing with class imbalance, casued the models to perform worse than with the standard balanced class option. However, models did perform best with a slight weighting of classes at .4 for attends and .6 for non-attends.

There was much diversity BOTH within & among Health Camp attendance rates as it pertains to:

1. The size of the Health Camp.

2. Among groups of the same size

3. Camp Location

4. Among different camps at the Same Location

There is a correlation and outlier among Health Camp Attendance Rates:

Models Specifics: Disagreement on 'Which Patients will attend'

I created a new dataframe which contains the prediction and probability results for three of the models used in this project. y_target_SUM is the total 'Score' or sum of predicted attendance (0 or 1) among all models and y_target. Top value = 4 Y_count_allModels is the the sum of all predicted values for attendance (0 or 1) among the three models being analyzed here Top value = 3

	Unnamed: 0	Camp Start Date - Registration Date	Registration Date - First Interaction	Camp Start Date - First Interaction	Camp End Date - Registration Date	Camp Length	Second	Third	A	F	G	2100	9999.0	2517	23384	Patient_ID	prediction	Proba	proba_kNN	proba_sVC	prediction_sVC	proba_xg	Y_count_allModels	Y_target_SUM
0	0	-119.0	14	-105	66	185	0	0	0	1	0	1	1	1	0	514789	0.0	0.2775604	0.0	0.1561240540625182	0.0	0.28253844	0.0	0.0
1	1	-410.0	559	149	361	771	0	0	0	1	0	1	1	0	1	508149	0.0	0.24640098	0.1	0.1557039638243477	0.0	0.23559786	0.0	0.0
2	2	-76.0	262	186	113	189	0	0	0	1	0	1	0	1	0	492650	0.0	0.33918345	0.2	0.1573992543873874	0.0	0.34879157	0.0	0.0
3	3	53.0	107	160	57	4	1	0	1	0	0	1	1	0	1	511274	0.0	0.4557019	0.3	0.1542467704410321	0.0	0.43797377	0.0	0.0
4	4	19.0	11	30	58	39	0	1	0	0	1	1	1	0	1	520795	1.0	0.5163712	0.4	0.5654859255622244	1.0	0.4760046	1.0	1.0

Upon closer examination there is disagreement among the models for which patients will attend a health event. It might be possible to gain useful insight be examining interesting patients: those which models agreed, disagreed, False Positives, False Negatives, etc.

Below are plots showing the following for each Patient:

A. The probability each model assigned to a patient (y-axis)
B. If the patient actually attended (shown by color)
C. The overall score group for that patient (shown by the respective column the dot appears in)

Experimentation with Tensorflow & Keras

Since this data is not descriptive black-box models are OK to use. Optimization of a neural network may produce good results. I used tensorflow and keras and was able to achieve similar results to other models with minimal training.

However, I am confident that these scores can improve by using a grid search and other optimization techniques.

AB Testing

*Under construction

*While AB testing is not the most direct method to compare models IT IS a valuable learning experience for implementing production code.

My plan is to conduct a mock analysis of 'model predictions' had they been actually implemented. Essentially, 'What if' MedCamp used previous camp data to train models for each camp individually and sequentially?

Which model (SVC, Logistic Regression, KNN) would perform best as a bandit ?!?

Steps in Experiment:

Put camps in-order by end date 2A. Remove overlap (if for Camp D , Camps A,B & C, end before Camp D, the patient data from Camp A,B &C would be used to train the [SVC,Logistic Regression, KNN] to predict Camp D’s patient attendance). 2B. However, if Camp C starts before D but does not end before D starts Camp C ‘s results can’t be used to train the bandits [SVC,Logistic Regression, KNN].
Append model results to data frame

Steps for modified Thompson Sampling:

I modified the traditional Thompson Sampling in favor for a numerical solution. Rather than explictely using the beta , a small penality ( - 0.5%) was imposed if a bandit is chosen AND the Beta was greater than the actual rate for that model. In a different experiment using different bandits this method acheived a modest improvement over using the exact Beta. I plan on conducting the same experiment for this data set a few more times. Results comming soon.

Below is a graph showing what Beta was chosen and how it compares to that bandit's current win rate

Results

Win = Correct prediction of a random patient's attendance for that camp.

Initial results are mixed, with some camps having improved prediction rates and some being worse. Three rounds of results are shown in the table below.

When I separated each camp and had the models predict patient attendance for just that camp, each model generally performed better than when I had used more data and trained them all at once. My next step will be to see how scores align with other features: --Camp Location, Camp Length etc. As indicated in the post-hoc above, there was much variation among camp attendance rates and this may result in poor performance. --The data may have been 'pulled' away from a better prediction vector by too much diversity and not enough data among the diversity to create a normal distribution.

	camp_ID	Win Rate SVC	Win Rate KNN	Win Rate Logistic Regression	Camp Size (Number of Patients)	knn2	svc2	log2	knn3	svc3	log3	knn4	svc4	log4	knn5R	svc5R	log5R	knn6R	svc6R	log6R
0	6578	0.345	0.336	0.362	2835	0.361	0.348	0.349	0.366	0.335	0.338	0.362	0.335	0.345	0.359	0.309	0.35	0.357	0.318	0.361
1	6532	0.874	0.861	0.863	1991	0.5	0.5	0.667	0.5	0.6	0.5	0.843	0.861	0.886	0.871	0.866	0.862	0.849	0.875	0.873
2	6543	0.88	0.877	0.852	6541	0.883	0.865	0.857	0.885	0.864	0.856	0.878	0.881	0.837	0.868	0.868	0.884	0.874	0.879	0.862
3	6580	0.935	0.925	0.913	3515	0.921	0.934	0.925	0.92	0.934	0.919	0.926	0.924	0.929	0.93	0.909	0.932	0.924	0.933	0.919
4	6570	0.92	0.918	0.927	3562	0.913	0.925	0.926	0.932	0.915	0.908	0.918	0.921	0.926	0.923	0.922	0.918	0.92	0.924	0.92
5	6542	0.836	0.81	0.866	2366	0.85	0.828	0.857	0.845	0.842	0.853	0.858	0.845	0.805	0.855	0.825	0.843	0.842	0.851	0.843
6	6571	0.92	0.892	0.899	2084	0.911	0.897	0.908	0.902	0.906	0.908	0.906	0.903	0.907	0.913	0.886	0.911	0.911	0.904	0.898
7	6527	0.379	0.344	0.428	4142	0.413	0.41	0.384	0.406	0.32	0.43	0.295	0.42	0.3	0.413	0.321	0.412	0.395	0.407	0.413
8	6526	0.954	0.969	0.959	3807	0.97	0.966	0.944	0.965	0.965	0.958	0.966	0.965	0.953	0.968	0.953	0.966	0.957	0.966	0.967
9	6539	0.883	0.861	0.871	1990	0.333	0.75	0.5	0.5	0.5	0.667	0.846	0.884	0.886	0.885	0.868	0.846	0.879	0.87	0.858
10	6528	0.306	0.305	0.243	1742	0.5	0.667	0.5	0.5	0.667	0.5	0.313	0.283	0.271	0.294	0.28	0.315	0.293	0.3	0.305
11	6555	0.646	0.61	0.489	1736	0.5	0.333	0.5	0.5	0.333	0.5	0.6	0.602	0.634	0.612	0.633	0.569	0.602	0.625	0.615
12	6541	0.409	0.382	0.429	1545	0.5	0.333	0.5	0.5	0.333	0.5	0.401	0.373	0.432	0.406	0.416	0.412	0.395	0.425	0.388
13	6523	0.277	0.258	0.315	2082	0.303	0.3	0.29	0.29	0.309	0.264	0.301	0.306	0.25	0.284	0.303	0.312	0.297	0.288	0.308
14	6538	0.841	0.837	0.845	3952	0.839	0.843	0.843	0.839	0.845	0.84	0.848	0.824	0.842	0.841	0.832	0.849	0.844	0.841	0.839
15	6549	0.682	0.719	0.687	1833	0.75	0.667	0.5	0.5	0.667	0.75	0.718	0.679	0.682	0.688	0.721	0.69	0.697	0.699	0.703
16	6586	0.766	0.789	0.675	2622	0.789	0.723	0.769	0.798	0.745	0.681	0.782	0.687	0.796	0.766	0.78	0.765	0.775	0.781	0.74
17	6554	0.843	0.859	0.846	2301	0.844	0.857	0.848	0.839	0.863	0.835	0.844	0.845	0.862	0.843	0.858	0.85	0.856	0.842	0.853
18	6529	0.495	0.573	0.577	3821	0.581	0.556	0.51	0.564	0.566	0.574	0.565	0.575	0.566	0.574	0.566	0.551	0.572	0.567	0.56
19	6540	0.888	0.901	0.906	1424	0.5	0.5	0.4	0.667	0.333	0.333	0.893	0.897	0.905	0.912	0.885	0.888	0.888	0.881	0.917
20	6534	0.305	0.296	0.283	3595	0.303	0.287	0.285	0.28	0.303	0.306	0.297	0.291	0.297	0.286	0.318	0.287	0.288	0.3	0.296
21	6535	0.871	0.831	0.884	1880	0.5	0.75	0.667	0.8	0.5	0.5	0.843	0.861	0.888	0.852	0.887	0.842	0.847	0.859	0.884
22	6561	0.542	0.685	0.592	198	0.5	0.5	0.333	0.5	0.5	0.4	0.385	0.62	0.682	0.648	0.645	0.586	0.595	0.694	0.592
23	6585	0.735	0.787	0.752	1396	0.5	0.6	0.5	0.5	0.5	0.667	0.797	0.731	0.706	0.755	0.777	0.759	0.71	0.767	0.786
24	6536	0.565	0.57	0.495	2035	0.573	0.493	0.557	0.554	0.549	0.564	0.552	0.536	0.572	0.557	0.561	0.547	0.555	0.542	0.57
25	6562	0.963	0.959	0.957	2336	0.956	0.957	0.968	0.968	0.959	0.946	0.96	0.951	0.969	0.961	0.962	0.955	0.964	0.952	0.963
26	6537	0.881	0.864	0.877	3857	0.878	0.876	0.869	0.883	0.838	0.883	0.888	0.84	0.859	0.871	0.874	0.879	0.872	0.877	0.874
27	6581	0.943	0.935	0.938	1483	0.667	0.5	0.75	0.667	0.5	0.75	0.939	0.943	0.933	0.941	0.936	0.938	0.943	0.94	0.935
28	6524	0.655	0.538	0.667	147	0.667	0.667	0.333	0.5	0.667	0.5	0.633	0.658	0.538	0.656	0.655	0.531	0.516	0.605	0.682
29	6587	0.462	0.581	0.556	77	0.531	0.536	0.542	0.308	0.6	0.556	0.585	0.429	0.545	0.531	0.52	0.556	0.632	0.481	0.474
30	6557	0.35	0.3	0.294	50	0.375	0.227	0.368	0.125	0.316	0.367	0.3	0.28	0.364	0.267	0.364	0.3	0.375	0.348	0.278
31	6546	0.931	0.967	0.936	401	0.957	0.904	0.966	0.951	0.946	0.949	0.939	0.972	0.919	0.937	0.967	0.939	0.937	0.94	0.966
32	6569	0.374	0.357	0.296	175	0.5	0.2	0.5	0.333	0.5	0.25	0.268	0.403	0.273	0.1	0.38	0.375	0.354	0.355	0.375
33	6564	0.915	0.884	0.885	512	0.88	0.912	0.901	0.893	0.911	0.886	0.915	0.871	0.903	0.904	0.871	0.917	0.898	0.87	0.918
34	6575	0.36	0.32	0.533	88	0.467	0.387	0.441	0.413	0.333	0.475	0.321	0.5	0.459	0.45	0.447	0.25	0.447	0.333	0.481
35	6552	0.125	0.514	0.548	80	0.429	0.529	0.5	0.381	0.538	0.519	0.412	0.488	0.556	0.538	0.529	0.286	0.462	0.467	0.543
36	6558	0.583	0.526	0.5	42	0.72	0.462	0.273	0.588	0.455	0.571	0.412	0.643	0.611	0.556	0.529	0.571	0.652	0.5	0.429

Impacts from Randomizing

I edited the AB testing scipt to randomize the ordering of patients. The fear was that, by chance, a bandit could stumble upon a series of easy to predict patients (ones where all the models correctly classify) and thereby skew how accurate / good the bandit/model actually is at correctly predicting patient attendance.

I performed all ANOVA comparisions among randomized vs non-randomized ordering of patients for AB testing. All p-values were between .22 and .98

There are No significant differences among win rates for models which had patients were ranomly chosen vs the arbitrary order from camp output.

Unique Patients

What are the Most unique Patients?

Due to the binomal nature of classification we need to break Unique into two groups:

Those patients who attend and No models predict they will attend
Those patients who do NOT attend but All models predict they will attend

There are 245 unique patients among 73,000 tested that would be considered the Most unique

What are the next most unique Patients?