choderalab/assaytools

Test out Gaussian Process for dealing with outliers

sonyahanson opened this issue · 8 comments

Thanks to Patrick and Bas for chatting about this over coffee after my lab meeting. Sounds like Lee had a very promising answer: Gaussian Processes!

Here are some potentially useful links I found by googling 'gaussian process outliers python':
https://bugra.github.io/work/notes/2014-05-11/robust-regression-and-outlier-detection-via-gaussian-processes/
https://ocefpaf.github.io/python4oceanographers/blog/2015/03/16/outlier_detection/

I don't think this is what we want. GPs are great for data in which there is a natural spatial relationship between the collected data, but that relationship must be learned. We are dealing with a very different case---we know what the relationship is, through the dissociation constant equations and mass conservation laws. Utilizing a GP of the sort in those examples would not only "forget" that information, but it doesn't allow us to propagate any uncertainty in which points are outliers into the posterior.

Instead, I think we should use an approach like this, where there is a prior on the fraction of outliers and the outlier distribution has a mean and variance that is inferred (and marginalized out) during MCMC sampling:
http://www.astroml.org/book_figures/chapter8/fig_outlier_rejection.html

But first, before we even talk about models, we absolutely need to collect some examples of the outliers and look at them to see what it tells us about the nature of the data.

Just making a note here that this is something we should keep at the front of our minds.

Agreed! Would be great to compile a list of data with outliers to find a strategy that works!

Awesome! This is exactly what we need to make this work! Thanks!

@jchodera has an idea about Bayesian outlier detection that he is interested in implementing.