vega/vega-dataflow

Impute data at regular intervals

domoritz opened this issue · 7 comments

I would like to impute missing values so that I have values at a regular interval. For example, in the spec below, the data table misses the values for 5 and 6. To use impute, I need to have the domain (a field to group by) in the same data source. With sequence, I could generate such a field but I don't see how I could it as the sequence is in a different datasource.

{
  "$schema": "https://vega.github.io/schema/vega/v3.0.json",
  "data": [
    {
      "name": "table",
      "values": [
        {"a": 1,"b": 28},
        {"a": 2,"b": 55},
        {"a": 3,"b": 43},
        {"a": 4,"b": 91},
        {"a": 7,"b": 81},
        {"a": 8,"b": 53}
      ],
      "transform": [{"type": "impute","field": "b","method": "mean"}]
    },
    {
      "name": "sequence",
      "transform": [{"type": "sequence","start": 0,"stop": 10}]
    }
  ]
}

Vega seems to expect the groupy field although it is optional in the docs.

screen shot 2017-05-16 at 14 23 19

jheer commented

I could imagine adding a domain parameter to the transform to explicitly indicate all values that should be represented. That would also allow tuple imputation without requiring groupby fields (and thanks for noting the bug / inconsistency there!).

However, I'm not sure how the domain input should be structured. For a single orderby field I could imagine a flat array. However, the transform currently allows multiple orderby fields. Internally, the transform uses an array of arrays (each inner array containing a unique set of orderby field values) to track the domain. I don't think this is the most intuitive or sensible for spec-level parameter input, not least of which because the expression language doesn't really provide facilities to generate or work with nested arrays. I supposed referencing another data set rather than a raw array might be another approach, but is also more complex (both for end users and in terms of internal implementation).

The use case you describe above could be achieved (in the single field case with flat domain array) by using the sequence expression function directly (not the tuple-generating sequence operator).

I wonder if we really need multiple orderby fields? Does it make sense to limit this to a single field instead?

jheer commented

@domoritz Any thoughts on my comment above?

A domain parameter in conjunction with the sequence expression would be exactly what I need for the case above. If it simplifies things, I think it is fine to only have a single orderby field. If an application really needs multiple orderby fields, one could derive a new field.

jheer commented

Addressed in 89debe8, released with vega-dataflow v2.0.0.

The orderby parameter is now named key, and accepts a single field rather than a field array. There is also a new keyvals parameters (akin to the domain parameter discussed above) that can be used to specify key values that must occur or otherwise be imputed.

I was wrong about the assumption that a single orderby or key field is almost always sufficient. Vega-Lite uses impute for stacking and we use the multiple keys. I wasn't aware of this as @kanitw implemented that part.

To fix this, I am adding a calcuate that generates a key from the key fields and uses that.

jheer commented

OK, let me know if the current transform fails for some reason. It sounds like you'll basically do the key generation work externally (in the spec) rather than letting impute do it internally (in the transform code), so I don't think there are any major performance concerns here.

Now that vega-dataflow 2.0.0 has been released, any breaking changes to the current design will force a new semver major version. I'd rather not have to do that :)