Go to The Need for Feature Engineering
MedCamp is a non-profit which provides health camps for people in various cities acorss America. MedCamp was having challenges maintaining their operations due to excessive operational expense.
Help MedCamp reduce wasteful spending AND maintain quality experiences of attendees by accurately predicting who will and will not attend a health-fare events.
Conduct AB testing to determine which model would have performed best in 'real time'
This project, Healthcare Analytics, came from Kaggle
Protecting patient data is critical. However, it does make following this READ.md more difficult. I will reorient the reader throughout!
train.zip contains 6 different csv files apart from the data dictionary as described below:
Health_Camp_Detail.csv – File containing HealthCampId, CampStartDate, CampEndDate and Category details of each camp.
Health_Camp_ID | Camp_Start_Date | Camp_End_Date | Category1 | Category2 | Category3 |
---|---|---|---|---|---|
6560 | 16-Aug-03 | 20-Aug-03 | First | B | 2 |
6530 | 16-Aug-03 | 28-Oct-03 | First | C | 2 |
6544 | 03-Nov-03 | 15-Nov-03 | First | F | 1 |
6585 | 22-Nov-03 | 05-Dec-03 | First | E | 2 |
6561 | 30-Nov-03 | 18-Dec-03 | First | E | 1 |
Train.csv & Test.csv – Both files have similar layouts, containing registration details for all the test camps. This includes PatientID, HealthCampID, RegistrationDate and a few anonymized variables as on registration date. Test.csv – File containing registration details for all the camps done after 1st April 2006. This includes PatientID, HealthCampID, RegistrationDate.
Patient_ID | Health_Camp_ID | Registration_Date | Var1 | Var2 | Var3 | Var4 | Var5 | |
---|---|---|---|---|---|---|---|---|
0 | 489652 | 6578 | 10-Sep-05 | 4 | 0 | 0 | 0 | 2 |
1 | 507246 | 6578 | 18-Aug-05 | 45 | 5 | 0 | 0 | 7 |
2 | 523729 | 6534 | 29-Apr-06 | 0 | 0 | 0 | 0 | 0 |
3 | 524931 | 6535 | 07-Feb-04 | 0 | 0 | 0 | 0 | 0 |
4 | 521364 | 6529 | 28-Feb-06 | 15 | 1 | 0 | 0 | 7 |
Patient_Profile.csv – This file contains Patient profile details like PatientID, OnlineFollower, Social media details, Income, Education, Age, FirstInteractionDate, CityType and EmployerCategory
Patient_ID | Online_Follower | LinkedIn_Shared | Twitter_Shared | Facebook_Shared | Income | Education_Score | Age | First_Interaction | City_Type | Employer_Category | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 516956 | 0 | 0 | 0 | 0 | 1 | 90 | 39 | 18-Jun-03 | Software Industry | |
1 | 507733 | 0 | 0 | 0 | 0 | 1 | None | 40 | 20-Jul-03 | H | Software Industry |
2 | 508307 | 0 | 0 | 0 | 0 | 3 | 87 | 46 | 02-Nov-02 | D | BFSI |
3 | 512612 | 0 | 0 | 0 | 0 | 1 | 75 | 47 | 02-Nov-02 | D | Education |
4 | 521075 | 0 | 0 | 0 | 0 | 3 | None | 80 | 24-Nov-02 | H | Others |
First_Health_Camp_Attended.csv & Second_Health_Camp_Attended.csv – These files contain details about people who attended health camp of first format. This includes Donation (amount) & Health_Score of the person.
Patient_ID | Health_Camp_ID | Donation | Health_Score | |
---|---|---|---|---|
0 | 506181 | 6560 | 40 | 0.43902439 |
1 | 494977 | 6560 | 20 | 0.097560976 |
2 | 518680 | 6560 | 10 | 0.048780488 |
3 | 509916 | 6560 | 30 | 0.634146341 |
4 | 488006 | 6560 | 20 | 0.024390244 |
Third_Health_Camp_Attended.csv - This file contains details about people who attended health camp of third format. This includes Numberofstallvisited & LastStallVisitedNumber.
Patient_ID | Health_Camp_ID | Number_of_stall_visited | Last_Stall_Visited_Number |
---|---|---|---|
517875 | 6527 | 3 | 1 |
504692 | 6578 | 1 | 1 |
504692 | 6527 | 3 | 1 |
493167 | 6527 | 4 | 4 |
510954 | 6528 | 2 | 2 |
There were imbalanced classes among potential health camp attendees; specifically for each geographic location and among camps occurring within that location showed... all had different attendance rates. Thus, simply estimating attendance based on a global or local history would lead to poor results. Additionally, it is important to note that some patients attended more than one MedCamp health event.
37633 | Unique Patient IDs |
---|---|
65 | Unique Health Camps |
20,534 | Count of Patients Attending a Health Camp |
15,011 | Unique Patients Attending at least one Health Camp |
102,000 | Patient-Event Registrations |
~ 20% | Global Attendance Rate |
3 | Classes or Types of Health Camps |
According to the description on Kaggle, MedCamp wanted to know the probability that a patient would successfully attend a health-fair event. For the first two camp types success was defined as getting a health score. For the third event-type success was going to at least one booth. The data from MedCamp was from several years and preliminary EDA showed that each patient could attend more than one Camp. Thus, to correctly create a target feature I needed to know the Camp ID,Patient ID, and if they successfully went to that event.
Given that each patient could attend more than one event, it was necessary to create a primary key for each patient & Health Camp combination by concatenating of the Patient and Camp ID.
Health Camp ID 6578 | Patient ID 489652 | Primary Key 4896526578 |
---|
Creating this primary key was helpful in combining information and creating additional time features; meaningful data was spread among several csv files.
Training the model with only the five anonymized features results in very poor performance.
The two anonymized features that had the highest feature weights were Var1 , Var5. Interestingly however, most of the patients had a zero-value for these two features. Without knowing what 'var1' is, and given that only a few thousand patients had non-zero values, I decided not to drop or edit these features for modeling purposes. There is simply not enough context to apply domain knowledge for the features Var1 - Var5.
Thus, feature engineering was instrumental in improving the model.
The categorical features include: City & Job Type. The binary categorical features pertained to if a patient shared their health fair attendance online through Twitter, LinkedIn, FaceBook, or were an online follower of MedCamp.
Most patients had many missing values for Job Type and other numerical features (discussed later). To avoid co-lineraity, I imputed 9999.0 for the missing values in the Job column.
Var1 | Var2 | Var3 | Var4 | Var5 | y_target | Camp Start Date - Registration Date | Registration Date - First Interaction | Camp Start Date - First Interaction | Camp End Date - Registration Date | Camp Length | 1036 | 1216 | 1217 | 1352 | 1704 | 1729 | 2517 | 2662 | 23384 | Second | Third | B | C | D | E | F | G | 2100 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 | 9.0 | 10.0 | 11.0 | 12.0 | 13.0 | 14.0 | 9999.0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1.0 | -25.0 | 278 | 253 | 34 | 59 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -24.0 | 99 | 75 | 161 | 185 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
4.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | -60.0 | 355 | 295 | 711 | 771 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Nearly 23,500 patients were missing the Camp Location. However, I was able to use the primary key to link a patient with a camp. Then, using sets, I was able to confirm that each camp ID is only associated with a certain city value by checking for unions and intersections among CampIDs, PatientIDs and Camp Length that were spread among the csv files. Thus, I was able to backtrack and impute missing city values for each patient which did improve prediction scores.
The numerical features provided by MedCamp were missing for most paitents. For example Age, Income and Education Score had less than 2,000 value each. Given that, 94% of the patients were missing all three values; imputing average values onto the other 35,000 patients for any numerical feature would be meaningless and create colinearity.
I used the primary key to track the unique patient events and consolidate important information into csv that could be used for training and testing.
The following features were created:
Feature Name (Days) | |
---|---|
Registration Date - First Interaction | |
Camp Start Date - Registration Date | |
Camp Start Date - First Interaction | |
Camp End Date - Registration Date | |
Camp Length |
Given the goal is to ensure all patients have an individualized health experience , there has to be specific supplies. Having accurate predictions means we can be confident in having the correct supplies and accomplishing the goal for improving health through individualized interventions.
As shown above all models achelved a similar ROC score. However, when we take the number of false negatives and false positives into consideration going with the XG Boost model is the best choice.
The Date features ended up improving scores for all models. Additionally, for all but some iterations of Random Forests, the date/times features would show among top feature importances.
The global attendance rate was 20%. The training and validation attendance rate was 27%. However, 5/10 camp locations had a attendance rate between 32.2% and 33.8%. The highest attendance rate was just over 70%. The high level of variance helps to explain why adjusting to the exact glabal attendance rate, when dealing with class imbalance, casued the models to perform worse than with the standard balanced class option. However, models did perform best with a slight weighting of classes at .4 for attends and .6 for non-attends.
I created a new dataframe which contains the prediction and probability results for three of the models used in this project. y_target_SUM is the total 'Score' or sum of predicted attendance (0 or 1) among all models and y_target. Top value = 4 Y_count_allModels is the the sum of all predicted values for attendance (0 or 1) among the three models being analyzed here Top value = 3
Unnamed: 0 | Var1 | Var2 | Var3 | Var4 | Var5 | Camp Start Date - Registration Date | Registration Date - First Interaction | Camp Start Date - First Interaction | Camp End Date - Registration Date | Camp Length | Second | Third | A | C | D | E | F | G | 2100 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 | 9.0 | 10.0 | 11.0 | 12.0 | 13.0 | 14.0 | 9999.0 | 1 | 2 | 3 | 4 | 1036 | 1216 | 1217 | 1352 | 1704 | 1729 | 2517 | 2662 | 23384 | Patient_ID | prediction | Proba | y_target | proba_kNN | prediction_kNN | proba_sVC | prediction_sVC | proba_xg | prediction_xg | Y_count_allModels | Y_target_SUM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -119.0 | 14 | -105 | 66 | 185 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 514789 | 0.0 | 0.2775604 | 0.0 | 0.0 | 0.0 | 0.1561240540625182 | 0.0 | 0.28253844 | 0.0 | 0.0 | 0.0 |
1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -410.0 | 559 | 149 | 361 | 771 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 508149 | 0.0 | 0.24640098 | 0.0 | 0.1 | 0.0 | 0.1557039638243477 | 0.0 | 0.23559786 | 0.0 | 0.0 | 0.0 |
2 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -76.0 | 262 | 186 | 113 | 189 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 492650 | 0.0 | 0.33918345 | 0.0 | 0.2 | 0.0 | 0.1573992543873874 | 0.0 | 0.34879157 | 0.0 | 0.0 | 0.0 |
3 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 53.0 | 107 | 160 | 57 | 4 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 511274 | 0.0 | 0.4557019 | 0.0 | 0.3 | 0.0 | 0.1542467704410321 | 0.0 | 0.43797377 | 0.0 | 0.0 | 0.0 |
4 | 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 19.0 | 11 | 30 | 58 | 39 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 520795 | 1.0 | 0.5163712 | 0.0 | 0.4 | 0.0 | 0.5654859255622244 | 1.0 | 0.4760046 | 0.0 | 1.0 | 1.0 |
Upon closer examination there is disagreement among the models for which patients will attend a health event. It might be possible to gain useful insight be examining interesting patients: those which models agreed, disagreed, False Positives, False Negatives, etc.
Below are plots showing the following for each Patient:
- A. The probability each model assigned to a patient (y-axis)
- B. If the patient actually attended (shown by color)
- C. The overall score group for that patient (shown by the respective column the dot appears in)
Since this data is not descriptive black-box models are OK to use. Optimization of a neural network may produce good results. I used tensorflow and keras and was able to achieve similar results to other models with minimal training.
However, I am confident that these scores can improve by using a grid search and other optimization techniques.
*Under construction
*While AB testing is not the most direct method to compare models IT IS a valuable learning experience for implementing production code.
My plan is to conduct a mock analysis of 'model predictions' had they been actually implemented. Essentially, 'What if' MedCamp used previous camp data to train models for each camp individually and sequentially?
Steps in Experiment:
- Put camps in-order by end date 2A. Remove overlap (if for Camp D , Camps A,B & C, end before Camp D, the patient data from Camp A,B &C would be used to train the [SVC,Logistic Regression, KNN] to predict Camp D’s patient attendance). 2B. However, if Camp C starts before D but does not end before D starts Camp C ‘s results can’t be used to train the bandits [SVC,Logistic Regression, KNN].
- Append model results to data frame
Steps for modified Thompson Sampling:
I modified the traditional Thompson Sampling in favor for a numerical solution. Rather than explictely using the beta , a small penality ( - 0.5%) was imposed if a bandit is chosen AND the Beta was greater than the actual rate for that model. In a different experiment using different bandits this method acheived a modest improvement over using the exact Beta. I plan on conducting the same experiment for this data set a few more times. Results comming soon.
Below is a graph showing what Beta was chosen and how it compares to that bandit's current win rate
Win = Correct prediction of a random patient's attendance for that camp.
Initial results are mixed, with some camps having improved prediction rates and some being worse. Three rounds of results are shown in the table below.
When I separated each camp and had the models predict patient attendance for just that camp, each model generally performed better than when I had used more data and trained them all at once. My next step will be to see how scores align with other features: --Camp Location, Camp Length etc. As indicated in the post-hoc above, there was much variation among camp attendance rates and this may result in poor performance. --The data may have been 'pulled' away from a better prediction vector by too much diversity and not enough data among the diversity to create a normal distribution.
camp_ID | Win Rate SVC | Win Rate KNN | Win Rate Logistic Regression | Camp Size (Number of Patients) | knn2 | svc2 | log2 | knn3 | svc3 | log3 | knn4 | svc4 | log4 | knn5R | svc5R | log5R | knn6R | svc6R | log6R | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6578 | 0.345 | 0.336 | 0.362 | 2835 | 0.361 | 0.348 | 0.349 | 0.366 | 0.335 | 0.338 | 0.362 | 0.335 | 0.345 | 0.359 | 0.309 | 0.35 | 0.357 | 0.318 | 0.361 |
1 | 6532 | 0.874 | 0.861 | 0.863 | 1991 | 0.5 | 0.5 | 0.667 | 0.5 | 0.6 | 0.5 | 0.843 | 0.861 | 0.886 | 0.871 | 0.866 | 0.862 | 0.849 | 0.875 | 0.873 |
2 | 6543 | 0.88 | 0.877 | 0.852 | 6541 | 0.883 | 0.865 | 0.857 | 0.885 | 0.864 | 0.856 | 0.878 | 0.881 | 0.837 | 0.868 | 0.868 | 0.884 | 0.874 | 0.879 | 0.862 |
3 | 6580 | 0.935 | 0.925 | 0.913 | 3515 | 0.921 | 0.934 | 0.925 | 0.92 | 0.934 | 0.919 | 0.926 | 0.924 | 0.929 | 0.93 | 0.909 | 0.932 | 0.924 | 0.933 | 0.919 |
4 | 6570 | 0.92 | 0.918 | 0.927 | 3562 | 0.913 | 0.925 | 0.926 | 0.932 | 0.915 | 0.908 | 0.918 | 0.921 | 0.926 | 0.923 | 0.922 | 0.918 | 0.92 | 0.924 | 0.92 |
5 | 6542 | 0.836 | 0.81 | 0.866 | 2366 | 0.85 | 0.828 | 0.857 | 0.845 | 0.842 | 0.853 | 0.858 | 0.845 | 0.805 | 0.855 | 0.825 | 0.843 | 0.842 | 0.851 | 0.843 |
6 | 6571 | 0.92 | 0.892 | 0.899 | 2084 | 0.911 | 0.897 | 0.908 | 0.902 | 0.906 | 0.908 | 0.906 | 0.903 | 0.907 | 0.913 | 0.886 | 0.911 | 0.911 | 0.904 | 0.898 |
7 | 6527 | 0.379 | 0.344 | 0.428 | 4142 | 0.413 | 0.41 | 0.384 | 0.406 | 0.32 | 0.43 | 0.295 | 0.42 | 0.3 | 0.413 | 0.321 | 0.412 | 0.395 | 0.407 | 0.413 |
8 | 6526 | 0.954 | 0.969 | 0.959 | 3807 | 0.97 | 0.966 | 0.944 | 0.965 | 0.965 | 0.958 | 0.966 | 0.965 | 0.953 | 0.968 | 0.953 | 0.966 | 0.957 | 0.966 | 0.967 |
9 | 6539 | 0.883 | 0.861 | 0.871 | 1990 | 0.333 | 0.75 | 0.5 | 0.5 | 0.5 | 0.667 | 0.846 | 0.884 | 0.886 | 0.885 | 0.868 | 0.846 | 0.879 | 0.87 | 0.858 |
10 | 6528 | 0.306 | 0.305 | 0.243 | 1742 | 0.5 | 0.667 | 0.5 | 0.5 | 0.667 | 0.5 | 0.313 | 0.283 | 0.271 | 0.294 | 0.28 | 0.315 | 0.293 | 0.3 | 0.305 |
11 | 6555 | 0.646 | 0.61 | 0.489 | 1736 | 0.5 | 0.333 | 0.5 | 0.5 | 0.333 | 0.5 | 0.6 | 0.602 | 0.634 | 0.612 | 0.633 | 0.569 | 0.602 | 0.625 | 0.615 |
12 | 6541 | 0.409 | 0.382 | 0.429 | 1545 | 0.5 | 0.333 | 0.5 | 0.5 | 0.333 | 0.5 | 0.401 | 0.373 | 0.432 | 0.406 | 0.416 | 0.412 | 0.395 | 0.425 | 0.388 |
13 | 6523 | 0.277 | 0.258 | 0.315 | 2082 | 0.303 | 0.3 | 0.29 | 0.29 | 0.309 | 0.264 | 0.301 | 0.306 | 0.25 | 0.284 | 0.303 | 0.312 | 0.297 | 0.288 | 0.308 |
14 | 6538 | 0.841 | 0.837 | 0.845 | 3952 | 0.839 | 0.843 | 0.843 | 0.839 | 0.845 | 0.84 | 0.848 | 0.824 | 0.842 | 0.841 | 0.832 | 0.849 | 0.844 | 0.841 | 0.839 |
15 | 6549 | 0.682 | 0.719 | 0.687 | 1833 | 0.75 | 0.667 | 0.5 | 0.5 | 0.667 | 0.75 | 0.718 | 0.679 | 0.682 | 0.688 | 0.721 | 0.69 | 0.697 | 0.699 | 0.703 |
16 | 6586 | 0.766 | 0.789 | 0.675 | 2622 | 0.789 | 0.723 | 0.769 | 0.798 | 0.745 | 0.681 | 0.782 | 0.687 | 0.796 | 0.766 | 0.78 | 0.765 | 0.775 | 0.781 | 0.74 |
17 | 6554 | 0.843 | 0.859 | 0.846 | 2301 | 0.844 | 0.857 | 0.848 | 0.839 | 0.863 | 0.835 | 0.844 | 0.845 | 0.862 | 0.843 | 0.858 | 0.85 | 0.856 | 0.842 | 0.853 |
18 | 6529 | 0.495 | 0.573 | 0.577 | 3821 | 0.581 | 0.556 | 0.51 | 0.564 | 0.566 | 0.574 | 0.565 | 0.575 | 0.566 | 0.574 | 0.566 | 0.551 | 0.572 | 0.567 | 0.56 |
19 | 6540 | 0.888 | 0.901 | 0.906 | 1424 | 0.5 | 0.5 | 0.4 | 0.667 | 0.333 | 0.333 | 0.893 | 0.897 | 0.905 | 0.912 | 0.885 | 0.888 | 0.888 | 0.881 | 0.917 |
20 | 6534 | 0.305 | 0.296 | 0.283 | 3595 | 0.303 | 0.287 | 0.285 | 0.28 | 0.303 | 0.306 | 0.297 | 0.291 | 0.297 | 0.286 | 0.318 | 0.287 | 0.288 | 0.3 | 0.296 |
21 | 6535 | 0.871 | 0.831 | 0.884 | 1880 | 0.5 | 0.75 | 0.667 | 0.8 | 0.5 | 0.5 | 0.843 | 0.861 | 0.888 | 0.852 | 0.887 | 0.842 | 0.847 | 0.859 | 0.884 |
22 | 6561 | 0.542 | 0.685 | 0.592 | 198 | 0.5 | 0.5 | 0.333 | 0.5 | 0.5 | 0.4 | 0.385 | 0.62 | 0.682 | 0.648 | 0.645 | 0.586 | 0.595 | 0.694 | 0.592 |
23 | 6585 | 0.735 | 0.787 | 0.752 | 1396 | 0.5 | 0.6 | 0.5 | 0.5 | 0.5 | 0.667 | 0.797 | 0.731 | 0.706 | 0.755 | 0.777 | 0.759 | 0.71 | 0.767 | 0.786 |
24 | 6536 | 0.565 | 0.57 | 0.495 | 2035 | 0.573 | 0.493 | 0.557 | 0.554 | 0.549 | 0.564 | 0.552 | 0.536 | 0.572 | 0.557 | 0.561 | 0.547 | 0.555 | 0.542 | 0.57 |
25 | 6562 | 0.963 | 0.959 | 0.957 | 2336 | 0.956 | 0.957 | 0.968 | 0.968 | 0.959 | 0.946 | 0.96 | 0.951 | 0.969 | 0.961 | 0.962 | 0.955 | 0.964 | 0.952 | 0.963 |
26 | 6537 | 0.881 | 0.864 | 0.877 | 3857 | 0.878 | 0.876 | 0.869 | 0.883 | 0.838 | 0.883 | 0.888 | 0.84 | 0.859 | 0.871 | 0.874 | 0.879 | 0.872 | 0.877 | 0.874 |
27 | 6581 | 0.943 | 0.935 | 0.938 | 1483 | 0.667 | 0.5 | 0.75 | 0.667 | 0.5 | 0.75 | 0.939 | 0.943 | 0.933 | 0.941 | 0.936 | 0.938 | 0.943 | 0.94 | 0.935 |
28 | 6524 | 0.655 | 0.538 | 0.667 | 147 | 0.667 | 0.667 | 0.333 | 0.5 | 0.667 | 0.5 | 0.633 | 0.658 | 0.538 | 0.656 | 0.655 | 0.531 | 0.516 | 0.605 | 0.682 |
29 | 6587 | 0.462 | 0.581 | 0.556 | 77 | 0.531 | 0.536 | 0.542 | 0.308 | 0.6 | 0.556 | 0.585 | 0.429 | 0.545 | 0.531 | 0.52 | 0.556 | 0.632 | 0.481 | 0.474 |
30 | 6557 | 0.35 | 0.3 | 0.294 | 50 | 0.375 | 0.227 | 0.368 | 0.125 | 0.316 | 0.367 | 0.3 | 0.28 | 0.364 | 0.267 | 0.364 | 0.3 | 0.375 | 0.348 | 0.278 |
31 | 6546 | 0.931 | 0.967 | 0.936 | 401 | 0.957 | 0.904 | 0.966 | 0.951 | 0.946 | 0.949 | 0.939 | 0.972 | 0.919 | 0.937 | 0.967 | 0.939 | 0.937 | 0.94 | 0.966 |
32 | 6569 | 0.374 | 0.357 | 0.296 | 175 | 0.5 | 0.2 | 0.5 | 0.333 | 0.5 | 0.25 | 0.268 | 0.403 | 0.273 | 0.1 | 0.38 | 0.375 | 0.354 | 0.355 | 0.375 |
33 | 6564 | 0.915 | 0.884 | 0.885 | 512 | 0.88 | 0.912 | 0.901 | 0.893 | 0.911 | 0.886 | 0.915 | 0.871 | 0.903 | 0.904 | 0.871 | 0.917 | 0.898 | 0.87 | 0.918 |
34 | 6575 | 0.36 | 0.32 | 0.533 | 88 | 0.467 | 0.387 | 0.441 | 0.413 | 0.333 | 0.475 | 0.321 | 0.5 | 0.459 | 0.45 | 0.447 | 0.25 | 0.447 | 0.333 | 0.481 |
35 | 6552 | 0.125 | 0.514 | 0.548 | 80 | 0.429 | 0.529 | 0.5 | 0.381 | 0.538 | 0.519 | 0.412 | 0.488 | 0.556 | 0.538 | 0.529 | 0.286 | 0.462 | 0.467 | 0.543 |
36 | 6558 | 0.583 | 0.526 | 0.5 | 42 | 0.72 | 0.462 | 0.273 | 0.588 | 0.455 | 0.571 | 0.412 | 0.643 | 0.611 | 0.556 | 0.529 | 0.571 | 0.652 | 0.5 | 0.429 |
I edited the AB testing scipt to randomize the ordering of patients. The fear was that, by chance, a bandit could stumble upon a series of easy to predict patients (ones where all the models correctly classify) and thereby skew how accurate / good the bandit/model actually is at correctly predicting patient attendance.
I performed all ANOVA comparisions among randomized vs non-randomized ordering of patients for AB testing. All p-values were between .22 and .98
There are No significant differences among win rates for models which had patients were ranomly chosen vs the arbitrary order from camp output.
What are the Most unique Patients?
Due to the binomal nature of classification we need to break Unique into two groups:
- Those patients who attend and No models predict they will attend
- Those patients who do NOT attend but All models predict they will attend
There are 245 unique patients among 73,000 tested that would be considered the Most unique
What are the next most unique Patients?
Due to the binomal nature of classification we need to break Unique into two groups:
- Those patients who attend and 1/3 model predicts they will attend
- Those patients who do NOT attend and 2/3 models predict they will attend
Question to ponder:
-
How similar was my approach to the recent google research
-
Is there conflicting data within and among camp locations? --As in patient A has values 1,1,1 for characteristics ABC, and does NOT attend and patient B has 1,1,1 for characteristics ABC, and DOES attend? Thinking about this as permutations there would be 2^N different ways for each binary outcome.
-
Clustering patients might reveal trends and predictors
-
I would like to see if multi-class models can predict how many models successfully predicted a patients attendance -- As in if 4 models were successful (success = 4) is there a meta pattern that could be learned?