ronikobrosly/causal-curve

Imlement a .predict method for GPS and TMLE module

NiklasTR opened this issue · 10 comments

I recently came across your package @ronikobrosly and really think it solves an unmet need for continuous variable to continuous variable causal inference problems.

In my problem, I am trying to find the causal relationship between the dose of medication and the continuous treatment outcome these patients have. I have used both the TMLE and GLM model to fit my training data with some interesting results so far.

image

To further explore the model fit and its predictive ability, I would like to use the trained model and run it on a validation dataset from a different institution that has not been used during training. Have you considered implementing such a function in the future or am I simply overlooking a method that is already available?

Thank you for putting this package together!

Hi @NiklasTR ! Thanks for the kind words and for sharing what you are using this for. It's nice to know others find this helpful!

You raise a really good point. The .predict() method isn't available yet but it would be a really easy thing to implement. Let me work on this, I should be able to implement this before the weekend is over. I'll keep you updated on this issue thread.

Ok here's the shameless plug: if you think this is useful please consider starring the repo, citing it, or sharing with colleagues. I've met lots of people that were interesting in running this sort of analysis but weren't aware there are methods for this.

Hi @ronikobrosly thank you this is great!

I have been sharing your project with a couple of people at UCSF and MIT that are into causal ML for healthcare. There are many problems in intensive care medicine where granular numeric input and outcome data is available, but easy ways to leverage such relationships are missing (given their roots in epidemiology, most frameworks are still focused on binary treatment effect estimation). So far we have often binned data, which -as a novice- appears like an inefficient way of learning models from such granular information.

Your last sentence is a great point. I must confess that the TMLE method in this package is basically a binary treatment model that is used across binned values of the continuous treatment. I spoke with Mark Van Der Laan (the researcher behind TMLE) and basically this was his only suggestion. As such, the TMLE method is a little finicky, it depends highly on what bin boundaries you pick. The GPS method doesn't have this problem though and natively runs with continuous treatments and is less finicky. Just wanted to let you know.

Yes, I saw that point in your paper - I might run a sensitivity analysis to see how bin size influences my model. Will keep you posted!

Hi @NiklasTR , when you have a chance would you mind trying to install the newest release (version 0.5.0) and trying out the new predict, predict_interval, and predict_log_odds functions? They are only available for the GPS tool. I need to think through the TMLE approach a bit more, sorry. One tricky thing is with the GPS method is it's not possible to specify the covariates when making the prediction, only the treatment. The generalized propensity score function is learned solely from the training data and used in a certain way where it's not possible to input a covariate it to make a prediction. So you'll only be able to specify a treatment value you want to predict with.

Once you think it's working, I'll close out the issue

Hi @NiklasTR , thought I would ping you again to see if you had any thoughts here. If I don't hear back in a few days, I might just go ahead and close out this issue.

Hi @ronikobrosly! I still owe you a proper test run! I was on-call the last couple days so only had time to pull the new version. I will give you a heads up by next weekend latest.
In the meantime - do you know of any method that would allow me to change the patient covariates during inference as well?

Ok sounds good @NiklasTR 👍 . That's the tricky part. When you use the GPS tool, after you .fit() your initial data, an object named gam_results is created within the GPS object. It contains the final generalized additive model used to predict points on the causal curve. This model is from the pygam package, so feel free to check out that package's API to see the full range of things you can do with the model. So for example, you can access the final model like so:

from causal_curve.gps import GPS

gps = GPS()
gps.fit(T = df['Treatment'], X = df[['X_1', 'X_2']], y = df['Outcome'])
gps.gam_results 

in the python terminal / in your jupyter notebook, try: dir(gps.gam_results) to see the pygam model methods available to you.

What makes this tricky is that this model prediction method has only two inputs:

  • The treatment value
  • The GPS value (i.e. a one-dimensional variable. A range of values based on trained covariates that adjusts the treatment effect to control for confounding bias).

You might think, "big deal, I can just somehow plug in my new covariate values in the gps_function you have in your code, and get my new gps values," but unfortunately it isn't clear to me how to do this. There isn't a way to do it in this function, for instance:

def _create_normal_gps_function(self):

So I'd need to really sit back and rethink things to get that functionality.

For now, when you provide a new treatment value to predict on, the gps object looks at the GPS function that it learned from the original training data, and gives you the relevant GPS you need to predict the point on the causal curve. If that makes sense.

Feel free to tinker with the code though, and if you figure out something let me know!

Closing this issue now that predict methods have been implemented

FYI @NiklasTR , I did a major update to the package in Jan and now the TMLE module works much better