Eliminate the `matrix_type` and `is_test` concepts

Question

Eliminate the `matrix_type` and `is_test` concepts

ecsalomon opened this issue 3 years ago · 0 comments

Pulling this out into an issue. Context from my earlier comment on the retrain-and-predict PR: We decided on using matrix_type in the first version of triage to handle different label behavior in train and test in the inspections use case. It never made much conceptual sense, and introducing new code paths has re-exposed the weaknesses of that choice and is committing us to further weirdness.

I don't know if you all discussed this on Tuesday, but here are a couple of possible solutions for Triage v 5.0.0. The first is probably easier to implement, but I think it is the less good solution:

Eliminate both the matrix_type and is_test properties of matrices
Include missing labels as missing in all matrices
Replace include_missing_labels_in_train_as with two keys: impute_missing_labels_at_train and impute_missing_labels_at_test. These can be True or False. If True, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points
The Architect writes matrices without label imputation, allowing them to be reused for any purpose
Catwalk does the label imputation on the fly when it is training or testing and always writes the labels to the database. This has high storage costs, but the reason we had to decide on the label imputation at matrix-building time initially was because we had not yet conceived of the train_results schema. Now that Triage can write train predictions, it can write train labels to the database, and the matrix_type concept is no longer needed. We could reduce the storage cost a bit by storing the labels separately from the predictions in the train and test results schemas (model is something like (matrix_uuid, labels_imputed, imputation_method, entity_id, as_of_date, label)) and joining predictions to labels through matrix_uuid and imputation behavior, but that join is likely complex (has to go through the experiments table?). We coulllllld squeeze one extra bit of metadata out by adding a flag for whether that label was imputed, but I don't have any ideas for where we would use that information, and it can always be reconstructed from the matrix itself.
The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
Separate matrix storage, rather than separate matrix types, is introduced for Production and Development.

An alternative:

Eliminate both the matrix_type and is_test properties of matrices
Replace include_missing_labels_in_train_as with two keys: impute_missing_labels_at_train and impute_missing_labels_at_test. These can be True or False. If True, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points
Instead of storing labels in the train/test matrices, we begin storing the design matrices and the labels separately. Train labels and test labels have separate stores, and the imputation behavior is a key in the label metadata. This reduces (but does not eliminate as in the first proposal) duplication of storage in the EWS case.
We introduce new metadata tables for Labels and experiment-label pairs, and add a labels_id column to many of the places where we also have a matrix_id column (models, predictions, evaluations, etc.). This increases db storage a little but not as much as having to store all the train labels in the db.
Include all entities in the cohort in all design matrices
The replace pathway checks the imputation flag on the labels to determine whether the Architect needs to create new labels
Design matrices are combined with the correct labels on reading, and rows are dropped if there is no label and the imputation key is False
The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
You can still skip writing train results to conserve db storage
Production/predict-forward/retrain-and-predict uses design matrices but not labels

Originally posted by @ecsalomon in #631 (comment)