grf-labs/policytree

Dummy variables ignored for split.step > 1

Closed this issue · 7 comments

I realized that dummy policy variables are ignored when policy_tree is specified with split.step > 1. This makes of course sense when having a closer look at the algorithm, but it is an unfortunate feature for practitioners that might not be aware of this. I solve it currently brute force by rounding the continuous variables and proceed with split.step = 1.

I agree that the current definition of split.step doesn't work when there's a mix of continuous and discrete data. For reference, the current definition is:

split.step: An optional approximation parameter (integer above zero),
            the number of possible splits to consider when performing
            tree search. split.step = 1 (default) considers every
            possible split, split.step = 10 considers splitting at every
            10'th distinct value and will yield a substantial speedup for
            densely packed continuous data.

We should probably change it so that split.step simply skips observations, as opposed to observations with distinct values.

Manually rounding the Xj's (1..p) provides more control and can circumvent this. To accommodate split.step use for mixed continuous and discrete Xj's we can just have split.step be a length p array with split step for each Xj.

I agree that manually rounding seems very natural and interpretable. Maybe we should emphasize that more in the documentation?

I'm not sold on the idea of having different split.step arguments for different variables, though. It's just too complicated and potentially confusing.

In hindsight, having split.step depend on the number of unique observations (as opposed to the number of observations) seems almost like a bug to me. We pitch split.step as a purely computational improvement that should essentially not affect performance when n is large, but this ends up not being true when features take on a smaller number of discrete values.

Intuitively, it does feel like simply skipping observations is the way to go. We might want to scale the number of points we skip as a function of the data size to keep things fast.

Skipping unique points is also legitimate I think in the case that the user specifies the skip step as a function of the number of unique points; if they keep skip step instead constant (which I think is the general case) then I think we should not be skipping on unique points if we were to treat this as a pure computational improvement. So in general I agree that we should probably be simply skipping observations

Thanks for reporting @MCKnaus, this is now fixed in #73

Just keeping this open for visibility since it will be a while until the next CRAN release

(Fix included in the 1.0.4 CRAN release)