loglabs/mltrace

Materializing historical component input/outputs: Technical Requirements

shreyashankar opened this issue · 0 comments

Users may want to create components like:

class Preprocessing(Component):
  def __init__(self, ...):
    super().__init__(...)

  def beforeRun(self, **kwargs):
     # how to give users access to history?
     # self.history.inputs["df"][-1] ??
     # historical_inputs is a parameter to beforeRun, historical_inputs["df"][-1] ??
     
  
  def afterRun(self, **local_vars):
     ...


@Preprocessing().run
def preprocess(df):
  return df * 2

Preliminaries

Task: we want to log the rolling mean of feat1 and KL divergence between consecutive runs of the component
Problem: on a given component run, users don't have access to historical inputs and outputs to that component
Solution idea: materialize historical inputs and outputs to the component and give users a flexible DSL (domain specific language) to use these historical inputs/outputs

To Dos:

  • Design the API for which the user will interact with the history
  • Handle storage layer: make sure component run inputs and outputs are fully stored as (keyname, value). This will involve extending the IOPointer to store the values exactly.
  • Handle the execution layer: in base_component.py, make sure we are retrieving the history from the DB. This involves, in the constructor, setting self.history.

We need to write designdocs for this. For the first TODO (API design), give a list of the primitives you are going to offer (i.e., the variables the user can interact with) and then examples of how the user might interact with these variables. At the end, demonstrate that the primitives you offer exhaust all of the user's technical requirements / potential use cases.

Technical requirements / use cases for the history API

  • be able to access any previous component run's input or output
  • inputs and outputs are indexed by (key, value). example: ("df", ...)
  • should be able to access historical inps / outs by both # of inps / outs and time range. for example, i want to access last 10 input dfs. or i want to access the input dfs generated in the last 10 days.