dssg/triage

Create features per fold

ecsalomon opened this issue · 0 comments

Triage would ideally create features for each temporal fold, such that columns (whether quantitative aggregates or categorical choice aggregates) that would not have been available (or which would not have met the conditions of the choice query, e.g., at least 1000 examples of the choice) at training time are not used in the training or test matrices for models built on that fold. This raises some questions we might encounter in implementing this behavior:

  • How will model grouping be handled when the same feature configuration results in different columns over time?
  • What metadata should we store about model features? Should we include both the observed and configured features in model metadata?
  • Will we need additional metadata for matrices? Should they also have the configured features in their metadata?
  • Given that test matrices are built at a different as-of-date than the training matrices they are paired with, will we make test matrices tied to specific training matrices (i.e., this test matrix has features available at X training time) or try to make them generic and get the features from the trained model? If the latter, will we run into situations where the test matrix is missing a feature that was available at training time, and how will we handle that?
  • Alternatively, should we refactor the feature generation much more broadly and avoid some of these matrix questions altogether?