Work with individual trees from bartMachine object

Question

Work with individual trees from bartMachine object

Closed this issue 3 years ago · 9 comments

Thank you for the excellent package.

I would like to work with the individual trees from a bartMachine object. Is this possible?

To clarify, let me explain my reason for wanting to do so. It is based on a result in Stefan Wager & Susan Athey (JASA, 2018; link: https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1319839). They show that a conditional mean prediction from a random forest can be interpreted as a kernel-weighted average: kernel weights are assigned to each observation to generate a prediction at point X=x in the covariate space, and these weights equal the share of trees for which a given observation is placed into a leaf that would include X=x.

If I have the individual trees, then I can construct such kernel weights. I would just need the ability to input a covariate value X=x and know, for each tree, what observations fall into the leaf that would generate the prediction for X=x.

Doing so is useful for various reasons (not to get too deep into the weeds, but I am interested to use this for a kernel-based implementation of "trimming bounds" from a paper by David S. Lee (2008, REStat)).

Answer 1 · 2020-07-09T20:16:45.000Z

Hi, Bart is a sum-of-trees model so being able to predict for a single tree only gives you a component of the overall prediction. I have another project called YARF https://github.com/kapelner/YARF (use the dev branch) that has this ability and a few years ago I was actually trying to do some work similar to the kernel weightings with a PhD student at Penn. Check out the following methods used in this demo: https://github.com/kapelner/YARF/blob/dev/demos/_demo.R#L84 This shows code for finding the exact nodes each prediction falls into. Adam

…

On Thu, Jul 9, 2020 at 3:10 PM cdsamii ***@***.***> wrote: Thank you for the excellent package. I would like to work with the individual trees from a bartMachine object. Is this possible? To clarify, let me explain my reason for wanting to do so. It is based on a result in Stefan Wager & Susan Athey (JASA, 2018; link: https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1319839). They show that a conditional mean prediction from a random forest can be interpreted as a kernel-weighted average: kernel weights are assigned to each observation to generate a prediction at point X=x in the covariate space, and these weights equal the share of trees for which a given observation is placed into a leaf that would include X=x. If I have the individual trees, then I can construct such kernel weights. I would just need the ability to input a covariate value X=x and know, for each tree, what observations fall into the leaf that would generate the prediction for X=x. Doing so is useful for various reasons (not to get too deep into the weeds, but I am interested to use this for a kernel-based implementation of "trimming bounds" from a paper by David S. Lee (2008, REStat)). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#30>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFAV6AUH23BCXKKXKK5ORDR2YIZJANCNFSM4OV4UOKA> .

-- Adam Kapelner, Ph.D. Assistant Professor of Mathematics Director of the Undergraduate Data Science and Statistics Program Queens College, City University of New York 65-30 Kissena Blvd., Kiely Hall Room 604 Flushing, NY, 11367 M: 516-435-6795 kapelner.com (scholar <https://scholar.google.com/citations?user=TzgMmnoAAAAJ>|research gate <http://www.researchgate.net/profile/Adam_Kapelner2>|publons <https://publons.com/researcher/431881/adam-kapelner/peer-review/>|linkedin <https://www.linkedin.com/in/adam-kapelner/>)

Answer 2 · 2020-07-09T22:15:05.000Z

@kapelner thanks so much for the reply. Yes, I am aware that it would give you only a component of the overall prediction. The reason having the trees is useful is to be able to construct the implied kernel weights.

Let me take a look at YARF.

In the meantime, if a more concrete reference is helpful, the get_sample_weights() function for the regression_forest() function in the grf package basically does what I'd like to do: input a covariate value x and then return a vector of weights for each unit in the training set, where each unit's weight is proportional to the number of times its covariate value appears in a leaf that would also contain x.

Answer 3 · 2020-07-10T04:04:11.000Z

I've released an alpha of your feature to master. You can clone and do R CMD INSTALL. Right now it's slow because I return for every single new observation a num_trees x gibbs samples x n tensor of booleans where true means the node that this new observation fell into has training observation i. Since those other two dimensions (trees and gibbs samples) may be important, I give the user an option to query them and do what they wish with the raw information. To get the weights I believe you want, I average over the two inner dimensions. It's slow because all of this takes place in R.

If this is what you want, I can just write a Java function to get those averages quickly. Here's some code that demos what's going on:

options(java.parameters = "-Xmx5g")
pacman::p_load(bartMachine, tidyverse, magrittr)

n = 100
x = rnorm(n)
y = rnorm(n)

bart_mod = bartMachine(data.frame(x = x), y, flush_indices_to_save_RAM = FALSE)
bart_mod

#for some new observations
n_star = 7
x_star = data.frame(x = rnorm(n_star))
indices_on_off = node_prediction_training_data_indices(bart_mod, as.matrix(x_star))
props = get_sample_weights(bart_mod, as.matrix(x_star))

#for all the training data
indices_on_off = node_prediction_training_data_indices(bart_mod)
props = get_sample_weights(bart_mod)

Answer 4 · 2020-07-10T13:47:02.000Z

That is very much appreciated! Let me work with it over the next few days and I will report back.

Answer 5 · 2020-07-10T15:33:38.000Z

Thank you again for your work on this. So looking at the results, it is possible that what I understood from the Athey and Wager paper is incorrect, or otherwise there was a mistranslation. The reason is that for the following simple demo:

n = 100
x = rnorm(n)
y = rnorm(n)

bart_mod = bartMachine(data.frame(x = x), y, flush_indices_to_save_RAM = FALSE)
bart_mod

#for some new observations
n_star = 7
x_star = data.frame(x = rnorm(n_star))
indices_on_off = node_prediction_training_data_indices(bart_mod, as.matrix(x_star))
props = bartMachine::get_sample_weights(bart_mod, as.matrix(x_star))

model_preds <- predict(bart_mod, 
                       new_data = x_star)
matrix_preds <- props%*%as.matrix(y)

We should find that model_preds and matrix_preds are equal. (I confirmed that this is what happens with the grf package output.) But this is not the case when it is run.

Answer 6 · 2020-07-10T21:53:29.000Z

So this can never be exactly equal like in the grf example. In the RF algorithm, the predicted node values are a sample average of a subset of the n training points. In BART, the predicted node values are not a sample average, they're a draw from a normal distribution centered at the sample average. However, if BART has a good fit, the variance of that normal distribution will be small so that the fitted values are approximately the mean, the sample averages. Thus, if I calculate the weights out (there's a bit of algebra), it will be only approximate. Is this good enough for your purposes?

…

On Fri, Jul 10, 2020 at 11:33 AM cdsamii ***@***.***> wrote: Thank you again for your work on this. So looking at the results, it is possible that what I understood from the Athey and Wager paper is incorrect, or otherwise there was a mistranslation. The reason is that for the following simple demo: n = 100 x = rnorm(n) y = rnorm(n) bart_mod = bartMachine(data.frame(x = x), y, flush_indices_to_save_RAM = FALSE) bart_mod #for some new observations n_star = 7 x_star = data.frame(x = rnorm(n_star)) indices_on_off = node_prediction_training_data_indices(bart_mod, as.matrix(x_star)) props = bartMachine::get_sample_weights(bart_mod, as.matrix(x_star)) model_preds <- predict(bart_mod, new_data = x_star) matrix_preds <- props%*%as.matrix(y) We should find that model_preds and matrix_preds are equal. (I confirmed that this is what happens with the grf package output.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFAV6B3BLS46IKNG6ZQOELR24YGBANCNFSM4OV4UOKA> .

-- Adam Kapelner, Ph.D. Assistant Professor of Mathematics Director of the Undergraduate Data Science and Statistics Program Queens College, City University of New York 65-30 Kissena Blvd., Kiely Hall Room 604 Flushing, NY, 11367 M: 516-435-6795 kapelner.com (scholar <https://scholar.google.com/citations?user=TzgMmnoAAAAJ>|research gate <http://www.researchgate.net/profile/Adam_Kapelner2>|publons <https://publons.com/researcher/431881/adam-kapelner/peer-review/>|linkedin <https://www.linkedin.com/in/adam-kapelner/>)

Answer 7 · 2020-07-11T18:24:40.000Z

Ah, yes that makes sense. Yes an approximation would fine, insofar as it will tend to track (albeit with some error) the contribution of each observation to a prediction at a particular point on the covariate space. I really appreciate your responsiveness. The bartMachine package is a very powerful prediction tool. On Fri, Jul 10, 2020 at 5:53 PM Adam Kapelner <notifications@github.com> wrote:

So this can never be exactly equal like in the grf example. In the RF algorithm, the predicted node values are a sample average of a subset of the n training points. In BART, the predicted node values are not a sample average, they're a draw from a normal distribution centered at the sample average. However, if BART has a good fit, the variance of that normal distribution will be small so that the fitted values are approximately the mean, the sample averages. Thus, if I calculate the weights out (there's a bit of algebra), it will be only approximate. Is this good enough for your purposes? On Fri, Jul 10, 2020 at 11:33 AM cdsamii ***@***.***> wrote: > Thank you again for your work on this. So looking at the results, it is > possible that what I understood from the Athey and Wager paper is > incorrect, or otherwise there was a mistranslation. The reason is that for > the following simple demo: > > n = 100 > x = rnorm(n) > y = rnorm(n) > > bart_mod = bartMachine(data.frame(x = x), y, flush_indices_to_save_RAM = FALSE) > bart_mod > > #for some new observations > n_star = 7 > x_star = data.frame(x = rnorm(n_star)) > indices_on_off = node_prediction_training_data_indices(bart_mod, as.matrix(x_star)) > props = bartMachine::get_sample_weights(bart_mod, as.matrix(x_star)) > > model_preds <- predict(bart_mod, > new_data = x_star) > matrix_preds <- props%*%as.matrix(y) > > We should find that model_preds and matrix_preds are equal. (I confirmed > that this is what happens with the grf package output.) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #30 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAFAV6B3BLS46IKNG6ZQOELR24YGBANCNFSM4OV4UOKA > > . > -- Adam Kapelner, Ph.D. Assistant Professor of Mathematics Director of the Undergraduate Data Science and Statistics Program Queens College, City University of New York 65-30 Kissena Blvd <https://www.google.com/maps/search/65-30+Kissena+Blvd?entry=gmail&source=g>., Kiely Hall Room 604 Flushing, NY, 11367 M: 516-435-6795 kapelner.com (scholar <https://scholar.google.com/citations?user=TzgMmnoAAAAJ>|research gate <http://www.researchgate.net/profile/Adam_Kapelner2>|publons <https://publons.com/researcher/431881/adam-kapelner/peer-review/ >|linkedin <https://www.linkedin.com/in/adam-kapelner/>) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#30 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVU34MX4277MSHX7OXMHWLR26EWNANCNFSM4OV4UOKA> .

-- -- Cyrus Samii Associate Professor of Politics | New York University Executive Director | Evidence in Governance & Politics | egap.org web: cyrussamii.com | twitter: twitter.com/cdsamii New book: Diversity, Violence, and Recognition (with Elisabeth King) https://global.oup.com/academic/product/diversity-violence-and-recognition-9780197509463

Answer 8 · 2020-07-21T01:05:54.000Z

Hi Cyrus,

I've made an update to the function and you can see an example via ?get_projection_weights. At this point I'm stumped. There are a few things going on

The sum of trees model is fit via backfitting so in some sense the node prediction within each tree can be counted many times (both positively and negatively) in the final model prediction.
There are transformations to the original response variable so it is scaled between [-1/2, +1/2] internally.
The node prediction values are bayesian draws from the node's posterior distribution.

So among these three considerations I just cannot figure out a way to create the weight vectors similar to in a random forest where we have none of these issues:

The trees are independently fit.
The response is never transformed.
The node prediction values are simply the sample average of the response values.

The weight vectors currently do not reflect the prediction values. However, over many, many simulations, I've found that I'm off by a multiplicative constant(?) and I added an argument to find the constant and adjust as a kludge.

It would be great if someone else could pick up the slack on this and make a PR as it is an important feature and I think it's possible to implement correctly.

Answer 9 · 2020-07-21T20:02:13.000Z

Interesting -- this should work for my purposes for now. It's possible that I could come back to it with a student who may be able to help out, since we are doing work on the kernel representations for this broad class of estimators.