kapelner/bartMachine

Work with individual trees from bartMachine object

Closed this issue · 9 comments

Thank you for the excellent package.

I would like to work with the individual trees from a bartMachine object. Is this possible?

To clarify, let me explain my reason for wanting to do so. It is based on a result in Stefan Wager & Susan Athey (JASA, 2018; link: https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1319839). They show that a conditional mean prediction from a random forest can be interpreted as a kernel-weighted average: kernel weights are assigned to each observation to generate a prediction at point X=x in the covariate space, and these weights equal the share of trees for which a given observation is placed into a leaf that would include X=x.

If I have the individual trees, then I can construct such kernel weights. I would just need the ability to input a covariate value X=x and know, for each tree, what observations fall into the leaf that would generate the prediction for X=x.

Doing so is useful for various reasons (not to get too deep into the weeds, but I am interested to use this for a kernel-based implementation of "trimming bounds" from a paper by David S. Lee (2008, REStat)).

@kapelner thanks so much for the reply. Yes, I am aware that it would give you only a component of the overall prediction. The reason having the trees is useful is to be able to construct the implied kernel weights.

Let me take a look at YARF.

In the meantime, if a more concrete reference is helpful, the get_sample_weights() function for the regression_forest() function in the grf package basically does what I'd like to do: input a covariate value x and then return a vector of weights for each unit in the training set, where each unit's weight is proportional to the number of times its covariate value appears in a leaf that would also contain x.

I've released an alpha of your feature to master. You can clone and do R CMD INSTALL. Right now it's slow because I return for every single new observation a num_trees x gibbs samples x n tensor of booleans where true means the node that this new observation fell into has training observation i. Since those other two dimensions (trees and gibbs samples) may be important, I give the user an option to query them and do what they wish with the raw information. To get the weights I believe you want, I average over the two inner dimensions. It's slow because all of this takes place in R.

If this is what you want, I can just write a Java function to get those averages quickly. Here's some code that demos what's going on:

options(java.parameters = "-Xmx5g")
pacman::p_load(bartMachine, tidyverse, magrittr)

n = 100
x = rnorm(n)
y = rnorm(n)

bart_mod = bartMachine(data.frame(x = x), y, flush_indices_to_save_RAM = FALSE)
bart_mod

#for some new observations
n_star = 7
x_star = data.frame(x = rnorm(n_star))
indices_on_off = node_prediction_training_data_indices(bart_mod, as.matrix(x_star))
props = get_sample_weights(bart_mod, as.matrix(x_star))

#for all the training data
indices_on_off = node_prediction_training_data_indices(bart_mod)
props = get_sample_weights(bart_mod)

That is very much appreciated! Let me work with it over the next few days and I will report back.

Thank you again for your work on this. So looking at the results, it is possible that what I understood from the Athey and Wager paper is incorrect, or otherwise there was a mistranslation. The reason is that for the following simple demo:

n = 100
x = rnorm(n)
y = rnorm(n)

bart_mod = bartMachine(data.frame(x = x), y, flush_indices_to_save_RAM = FALSE)
bart_mod

#for some new observations
n_star = 7
x_star = data.frame(x = rnorm(n_star))
indices_on_off = node_prediction_training_data_indices(bart_mod, as.matrix(x_star))
props = bartMachine::get_sample_weights(bart_mod, as.matrix(x_star))

model_preds <- predict(bart_mod, 
                       new_data = x_star)
matrix_preds <- props%*%as.matrix(y)

We should find that model_preds and matrix_preds are equal. (I confirmed that this is what happens with the grf package output.) But this is not the case when it is run.

Hi Cyrus,

I've made an update to the function and you can see an example via ?get_projection_weights. At this point I'm stumped. There are a few things going on

  • The sum of trees model is fit via backfitting so in some sense the node prediction within each tree can be counted many times (both positively and negatively) in the final model prediction.
  • There are transformations to the original response variable so it is scaled between [-1/2, +1/2] internally.
  • The node prediction values are bayesian draws from the node's posterior distribution.

So among these three considerations I just cannot figure out a way to create the weight vectors similar to in a random forest where we have none of these issues:

  • The trees are independently fit.
  • The response is never transformed.
  • The node prediction values are simply the sample average of the response values.

The weight vectors currently do not reflect the prediction values. However, over many, many simulations, I've found that I'm off by a multiplicative constant(?) and I added an argument to find the constant and adjust as a kludge.

It would be great if someone else could pick up the slack on this and make a PR as it is an important feature and I think it's possible to implement correctly.

Interesting -- this should work for my purposes for now. It's possible that I could come back to it with a student who may be able to help out, since we are doing work on the kernel representations for this broad class of estimators.