math-a3k/covid-ht

Classifier

Opened this issue · 1 comments

This issue is for discussing the Classifier.

Currently, the Classifier is handled at a django-ai level, not at covid-ht. Any discussion here should end in a django-ai action (change, implementation, etc.)

The requirements for the implementation of the Classifier are:

  • Be able to handle categorical data
  • Be able to handle numerical data
  • Be able to deal with missing values (NaNs)

scikit's Histogram-Based Gradient Boosting Tree is currently being used.

According to the simulated data included, it is able to achieve a ~~ 90% of accurracy (est. by 10-fold CV) with only 5 variables to take into account (RBC, WBC, PLT, NEUT, LYMPH) - while the rest are noisy / non-informative about class.

Any discussion for improvements about this (or another) classifier should go here.

Both Support Vector Machines and Neural Networks - in their vanilla versions - handle only numeric data (i.e. 'rbc'), not categorical (i.e. 'sex'). Although this can be overcome by encoding, those techniques have limitations and are not ideal.

Categorical data is particularly important for this problem, due to results varying by sex, age group, et al. Not taking into account such variables should lead to a "bad" (not accurate) classifier.

Classification Trees (CTs) have "built-in support" for both categorical data and missing data (though it may vary in the implementation), so, it it would make them the first choice for the problem (Logistic regression can't handle missing data, it has to be imputed).

They (CTs) also have the advantage of easily interpretation, but this is traded for better accuracy using Boosting. Although a CT can be 'graphed', it would take more time for person to 'follow the diagram' than to enter the values and let the machine do the classification. Further understanding of the data should be done 'outside' covid-ht, via the CSV data download, the goal of covid-ht is to do the best job at classification.

Being able to handle missing values should also be very important, specially for combining data from different sources, where different blood tests can be considered. The success of the classifier depends on the amount and quality of the data. Getting quality data may not be that easy, i.e. having a specific COVID19 testing at the same time that the blood is sampled and sharing it may require patient consent (although hemogram data is easily anonymized, it is still patient's data and thus requires consideration)