Eliminate the `matrix_type` and `is_test` concepts
ecsalomon opened this issue · 0 comments
ecsalomon commented
Pulling this out into an issue. Context from my earlier comment on the retrain-and-predict PR: We decided on using matrix_type
in the first version of triage to handle different label behavior in train and test in the inspections use case. It never made much conceptual sense, and introducing new code paths has re-exposed the weaknesses of that choice and is committing us to further weirdness.
I don't know if you all discussed this on Tuesday, but here are a couple of possible solutions for Triage v 5.0.0. The first is probably easier to implement, but I think it is the less good solution:
- Eliminate both the
matrix_type
andis_test
properties of matrices - Include missing labels as missing in all matrices
- Replace
include_missing_labels_in_train_as
with two keys:impute_missing_labels_at_train
andimpute_missing_labels_at_test
. These can beTrue
orFalse
. IfTrue
, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points - The Architect writes matrices without label imputation, allowing them to be reused for any purpose
- Catwalk does the label imputation on the fly when it is training or testing and always writes the labels to the database. This has high storage costs, but the reason we had to decide on the label imputation at matrix-building time initially was because we had not yet conceived of the
train_results
schema. Now that Triage can write train predictions, it can write train labels to the database, and thematrix_type
concept is no longer needed. We could reduce the storage cost a bit by storing the labels separately from the predictions in the train and test results schemas (model is something like (matrix_uuid, labels_imputed, imputation_method, entity_id, as_of_date, label)) and joining predictions to labels through matrix_uuid and imputation behavior, but that join is likely complex (has to go through the experiments table?). We coulllllld squeeze one extra bit of metadata out by adding a flag for whether that label was imputed, but I don't have any ideas for where we would use that information, and it can always be reconstructed from the matrix itself. - The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
- Separate matrix storage, rather than separate matrix types, is introduced for Production and Development.
An alternative:
- Eliminate both the
matrix_type
andis_test
properties of matrices - Replace
include_missing_labels_in_train_as
with two keys:impute_missing_labels_at_train
andimpute_missing_labels_at_test
. These can beTrue
orFalse
. IfTrue
, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points - Instead of storing labels in the train/test matrices, we begin storing the design matrices and the labels separately. Train labels and test labels have separate stores, and the imputation behavior is a key in the label metadata. This reduces (but does not eliminate as in the first proposal) duplication of storage in the EWS case.
- We introduce new metadata tables for Labels and experiment-label pairs, and add a
labels_id
column to many of the places where we also have amatrix_id
column (models, predictions, evaluations, etc.). This increases db storage a little but not as much as having to store all the train labels in the db. - Include all entities in the cohort in all design matrices
- The
replace
pathway checks the imputation flag on the labels to determine whether the Architect needs to create new labels - Design matrices are combined with the correct labels on reading, and rows are dropped if there is no label and the imputation key is
False
- The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
- You can still skip writing train results to conserve db storage
- Production/predict-forward/retrain-and-predict uses design matrices but not labels
Originally posted by @ecsalomon in #631 (comment)