ying531/MCMC-SymReg

Questions regarding using the MCMC samples

Opened this issue · 6 comments

Hi,

First, I really want to thank the authors who have been pioneering this direction (bayesian methods in SR) and providing the codes here. As someone who has read the paper and played with the codes locally, I found several places somewhat confusing (and cannot get an explanation from the paper).

As an MCMC-based algorithm (by treating each valid expression tree as a sample from the unknown posterior space of expressions), I think the goal is to utilize all (or most of) the samples we get (in certain cases we might throw away some initial ones as burn-in). But in the predict function, it seems that the default (and only implemented) way is to use the K trees (and their OLS param beta) in the last iteration of the sampling process. This looks unintuitive to me because, after all, this algorithm is doing sampling across different iterations instead of doing optimization; therefore, the samples we get from the last iteration shouldn't have any advantage over, say, the samples from the second to last (or even the first) iteration. Please feel free to point me out if I understand anything incorrectly.

Then, if we decide to use more samples than simply the last one, I wonder what's the way to "combine" them. In most MCMC problem, the samples can be used as Monte Carlo approximations to posterior predictive tasks. In this case, the vanilla way is to evaluate the test data on all the samples we get and take the average of these evaluations. But this then makes our predictive model super unintuitive and complicated. I wonder is there some other ways to make use of the samples when writing the paper?

(p.s. regarding the first point I've made, I did a very toy experiment by using BSR to predict z = y + x, and observed that the train_err_ inside the regressor is basically pretty fluctuating (instead of having decreasing pattern like in optimization problems); so I guess there's no real "learning" done as iteration increases, we should just obtain more and more samples to get our posterior approximation more accurate)

(I also believe that the questions mentioned by @cobac in the earlier #2 is somewhat related to this, so pardon me for @ you as well)

Once again I really appreciate your amazing work and look forward to your response.

cobac commented

Then, if we decide to use more samples than simply the last one, I wonder what's the way to "combine" them

That's IMO one of the key points to figure out as far as Bayesian SR goes. Posterior distributions over discrete spaces with a high number of elements are always a mess.

Paraphrasing from the discussion on report I wrote about Bayesian SR (linked on one of the comments on #2):

.... how to select output expressions from the MCMC chains. Jin et al. report the last expression from the chains after a prespecified number of iterations. However, we have observed that the last expression does not have to be a particularly good one. We could choose instead the equation with the highest fit of a chain, but this approach would be biased towards more complex expressions. In this case, if the sampler had discovered the correct expression it is guaranteed to have the best fit since the datasets were generated without noise. But during real uses cases, or simulations with noisy data, this approach would worsen the overfitting issues. Alternatively, to restrain overfitting we could report the Pareto frontier between fit and complexity like Eureqa or PySR/SymbolicRegression.jl do. Nevertheless, ideally we would use an approach that allowed us to take advantage of the Bayesian components of the model and generate confidence distributions. For predictive purposes this is easy since we can just generate predictions from all the samples of a chain---possibly after a warm-up period. But it is not clear how to generate confidence distributions if our objective is to discover interpretable expressions. If there is a set of tree structures that the MCMC chains explore frequently, we can use their relative frequencies as confidence distributions. However, we would have to take into consideration the distribution of their linear operator node coefficients. One avenue for exploring this approach is to split the sampling algorithm into two separate Gibbs steps. The first step samples a tree structure, whereas the second step samples a new set of linear operators.

Best of luck with your projects! 😊

Thanks, @cobac. What you posted is truly amazing and I even regret not taking a closer look into your report when I came up with some questions 😄. It seems that we have the same concerns and feedback on the vanilla implementation, which, despite the amazing ideas and formulations behind it, doesn't give a convincing explanation of how the samples are utilized.

I guess right now the most confusing point is regarding the training error over the iteration curve on the paper. With the current implementation of using the last expression, it seems unintuitive to have any convergence guarantee for the training error.

Hello @listar2000 and @cobac , thank you for the great discussion! Excuse me for not seeing this earlier. Yes, the current implementation uses the last iteration of the trees because that worked well in our experiments, and it seemed that the algorithm can sometimes concentrate around a good solution. However, it is definitely true that a rigorous use of MCMC should be utilizing all (after burn-in) samples to get posterior. Personally, I am also curious whether any of the two outpeforms the other.

Hi @ying531 thanks for the response. I do understand the MCMC side of this algorithm, but after all, for a symbolic regression problem I believe the goal is still to "cherry-pick" the best expression(s) we have explored so far. Simply keeping a running-best expression tree over the iterations should give a better performance than using the last iteration in my opinion (I implemented this for my own project and it works). I can also send a PR with some simple adjustments if it seems reasonable to you as well.

Hi @listar2000 thanks so much for your great proposal! That sounds great. Please feel free to send it. Thanks a lot.

Simply keeping a running-best expression tree over the iterations should give a better performance than using the last iteration in my opinion

This will most likely result on bias towards more complex expressions. To explore this path I'd suggest keeping track of the best performing tree of each possible size (like SymbolicRegression.jl does or the old Eureqa did), and then reporting the Pareto frontier between the accuracy metric and complexity.