Impute data at regular intervals
domoritz opened this issue · 7 comments
I would like to impute missing values so that I have values at a regular interval. For example, in the spec below, the data table misses the values for 5
and 6
. To use impute, I need to have the domain (a field to group by) in the same data source. With sequence
, I could generate such a field but I don't see how I could it as the sequence is in a different datasource.
{
"$schema": "https://vega.github.io/schema/vega/v3.0.json",
"data": [
{
"name": "table",
"values": [
{"a": 1,"b": 28},
{"a": 2,"b": 55},
{"a": 3,"b": 43},
{"a": 4,"b": 91},
{"a": 7,"b": 81},
{"a": 8,"b": 53}
],
"transform": [{"type": "impute","field": "b","method": "mean"}]
},
{
"name": "sequence",
"transform": [{"type": "sequence","start": 0,"stop": 10}]
}
]
}
I could imagine adding a domain
parameter to the transform to explicitly indicate all values that should be represented. That would also allow tuple imputation without requiring groupby fields (and thanks for noting the bug / inconsistency there!).
However, I'm not sure how the domain
input should be structured. For a single orderby
field I could imagine a flat array. However, the transform currently allows multiple orderby
fields. Internally, the transform uses an array of arrays (each inner array containing a unique set of orderby field values) to track the domain. I don't think this is the most intuitive or sensible for spec-level parameter input, not least of which because the expression language doesn't really provide facilities to generate or work with nested arrays. I supposed referencing another data set rather than a raw array might be another approach, but is also more complex (both for end users and in terms of internal implementation).
The use case you describe above could be achieved (in the single field case with flat domain array) by using the sequence
expression function directly (not the tuple-generating sequence
operator).
I wonder if we really need multiple orderby
fields? Does it make sense to limit this to a single field instead?
A domain
parameter in conjunction with the sequence
expression would be exactly what I need for the case above. If it simplifies things, I think it is fine to only have a single orderby
field. If an application really needs multiple orderby
fields, one could derive a new field.
Addressed in 89debe8, released with vega-dataflow v2.0.0.
The orderby
parameter is now named key
, and accepts a single field rather than a field array. There is also a new keyvals
parameters (akin to the domain
parameter discussed above) that can be used to specify key values that must occur or otherwise be imputed.
I was wrong about the assumption that a single orderby
or key
field is almost always sufficient. Vega-Lite uses impute for stacking and we use the multiple keys. I wasn't aware of this as @kanitw implemented that part.
To fix this, I am adding a calcuate that generates a key from the key fields and uses that.
OK, let me know if the current transform fails for some reason. It sounds like you'll basically do the key generation work externally (in the spec) rather than letting impute
do it internally (in the transform code), so I don't think there are any major performance concerns here.
Now that vega-dataflow 2.0.0 has been released, any breaking changes to the current design will force a new semver major version. I'd rather not have to do that :)