ying531/MCMC-SymReg

Questions regarding acceptance ratio calculation

Closed this issue ยท 8 comments

cobac commented

Hey,

I've been studying your paper & code (very interesting!) and there are a few points I don't think I'm understanding properly. Hopefully you can clarify the following points for me ๐Ÿ˜Š .

  1. I don't understand why the prior probabilities of the linear coefficients P(\Theta^{*})/P(\Theta^{(t)}) are not included in the acceptance rate calculation in the standard MH step (Expression 1).
    • I understand not including the probabilities of transitioning from one set of coefficients to the other since they are both drawn from p(\Theta|T), but why not include the probability of the proposed/previous parameters?
  2. In the same line as the previous question, shouldn't the prior probabilities of the variances p(\Sigma^{*})/p(\Sigma^{(t)}) be included as well in the standard MH case?
    • I believe in your code you include the variance of the residuals p(\sigma) although it is not included in the formula in the paper (Expression 1).
      logR = logR + np.log(invgamma.pdf(new_sigma, sig)) - np.log(invgamma.pdf(sigma, sig))
  3. And finally, I believe that there are two reversible jump edge cases: when expanding from 0 to some linear transformations, and when shrinking to 0 linear transformations.
    • Under those circumstances I think your code is still including the probabilities of variances \sigma_\Theta that are not used to draw from any parameters. In any case this is most likely irrelevant in all scenarios.

Thank you.

PS: I believe there is a typo on p.5 at the end of the second paragraph of the "Shrinkage" section. It should read "Then we discard U^{*} and obtain \Theta^{*}, \sigma^{*}_\Theta".

Hi,

Thank you for your interest in our work! Per your questions,

  1. We didn't include the linear coefficients in our Bayesian framework - the idea is from generalized additive models. We put the trees as probabilisitic/Bayesian, and directly fit an OLS for the outputs from trees.
  2. Yes the variances should be included in the formula. Thanks for pointing it out!
  3. Do you mean jumping from trees with no nodes to trees with nodes? In that case we also need reversible jump. But I think in practice we can also force the trees to always have nodes, for example, not to eliminate the root node.

And thank you so much for pointing out the typo! We'll work on the revision. :)

cobac commented

Thank you for your reply!

  1. We didn't include the linear coefficients in our Bayesian framework - the idea is from generalized additive models. We put the trees as probabilisitic/Bayesian, and directly fit an OLS for the outputs from trees.

With my question I was referring to the linear coefficients of the linear transformation nodes (i.e. \Theta). Were you referring to them or to the tree linear coefficients (\beta)? In the case with no reversible jump you still update the \Theta set of a tree, so my question was why not include their prior probabilities P(\Theta^{*})/P(\Theta^{(t)}) in Expression 1.

  1. Yes the variances should be included in the formula. Thanks for pointing it out!

Including the variances of the linear coefficients of the linear transformation nodes (\sigma_a, \sigma_b) \in \sigma_\Theta? Currently in your code I think you are only including the ratio of P(\sigma^2) but not P(\sigma_\Theta).

  1. Do you mean jumping from trees with no nodes to trees with nodes? In that case we also need reversible jump. But I think in practice we can also force the trees to always have nodes, for example, not to eliminate the root node.

I meant jumping to/from trees that (don't) have linear transformation nodes, but have other operator nodes. I think in those cases the contribution of new_sa2 and new_sb2 in auxProp() will be 0 but the contributions from new_sa2 and new_sb2 will still be added, although those variances are not used to sample any coefficient. But in any case this is definitely not going to affect the whole performance of the algorithm.

MCMC-SymReg/codes/funcs.py

Lines 991 to 1005 in 813ad00

# contribution of new_sa2 and new_sb2
logh += np.log(invgamma.pdf(new_sa2, 1))
logh += np.log(invgamma.pdf(new_sb2, 1))
loghstar += np.log(invgamma.pdf(old_sa2, 1))
loghstar += np.log(invgamma.pdf(old_sb2, 1))
for i in np.arange(0, len(UaList)):
# contribution of UaList and UbList
logh += np.log(norm.pdf(UaList[i], loc=0, scale=np.sqrt(new_sa2)))
logh += np.log(norm.pdf(UbList[i], loc=0, scale=np.sqrt(new_sb2)))
for i in np.arange(0, len(NUaList)):
# contribution of NUaList and NUbList
loghstar += np.log(norm.pdf(NUaList[i], loc=0, scale=np.sqrt(old_sa2)))
loghstar += np.log(norm.pdf(NUbList[i], loc=0, scale=np.sqrt(old_sb2)))

MCMC-SymReg/codes/funcs.py

Lines 1072 to 1090 in 813ad00

# contribution of sigma_ab
logh += np.log(invgamma.pdf(new_sa2, 1))
logh += np.log(invgamma.pdf(new_sb2, 1))
loghstar += np.log(invgamma.pdf(old_sa2, 1))
loghstar += np.log(invgamma.pdf(old_sb2, 1))
# contribution of u_a, u_b
for i in np.arange(len(last_a), nn):
logh += norm.pdf(NaList[i], loc=1, scale=np.sqrt(new_sa2))
logh += norm.pdf(NbList[i], loc=0, scale=np.sqrt(new_sb2))
# contribution of U_theta
for i in np.arange(0, len(UaList)):
logh += np.log(norm.pdf(UaList[i], loc=0, scale=np.sqrt(new_sa2)))
logh += np.log(norm.pdf(UbList[i], loc=0, scale=np.sqrt(new_sb2)))
for i in np.arange(0, len(NUaList)):
loghstar += np.log(norm.pdf(NUaList[i], loc=0, scale=np.sqrt(old_sa2)))
loghstar += np.log(norm.pdf(NUbList[i], loc=0, scale=np.sqrt(old_sb2)))

Thank you again ๐Ÿ˜Š

Hi,

  1. Yes we should include them in the equations. Thanks for pointing it out:)
  2. Sorry for the misunderstanding - \sigma_\Theta should also be included. The prior is inverse gamma, and I think they are included in the first code block you cite?
  3. I agree that this is a bit tricky because the sigma's are not used in the likelihood of data, but still present in the joint likelihood of data and parameters. In the code it is included because we computed the overall likelihood.

Thank you again for pointing out typos in the paper - we will work on them. :)

cobac commented
  1. Yes we should include them in the equations. Thanks for pointing it out:)

But they are not included in the code either, so I'm not sure. You select fStruc(...)[0] which is loglike but exclude loglike_para.

MCMC-SymReg/codes/funcs.py

Lines 1282 to 1283 in 813ad00

strucl = fStruc(Root, n_feature, Ops, Op_weights, Op_type, beta, new_sa2, new_sb2)[0]
struclstar = fStruc(oldRoot, n_feature, Ops, Op_weights, Op_type, beta, sigma_a, sigma_b)[0]

  1. Sorry for the misunderstanding - \sigma_\Theta should also be included. The prior is inverse gamma, and I think they are included in the first code block you cite?

Yes, they are in the cases with reversible jumps (in hratio), but I don't think they are included in the case with no reversible jump.

Currently the only disparity between the paper and the code (I think) is the ratio of variances of the residuals \sigma^2. Since you exclude both the probability of the parameters \Theta and the probability of the variance of the parameters \sigma_Theta when there is no reversible jump and they are not included in Expression 1, I assumed it was the intended design of the algorithm. And that's where my questions come from, since I would have assumed that they must not be excluded even in the case with no reversible jump.

Maybe I'm wrong. Thank you so much for being so responsive and happy I could at least help out with the typos :)

I got your point! Yes thanks for pointing it out, I'll revise the codes. :)

@cobac Hi! I was giving a careful review of the codes, and would like to clarify a little bit more.

We should include the likelihood of linear node parameters. But since they are all independent from the prior (sigma_a, sigma_b), which are further from inv_gamma, f(new parameters) * q(old para | new para) = f(old para) * q(new_para | old para), so they cancel out in the expression of ordinary MH step and do not need to be computed.

cobac commented

Hey @ying531!

I'm not confident about my opinion since it's been a bit since I thought about this topic, but I think you are correct! I think I omitted that q(S*|S) includes as well the transition of the linear operators. But in that case the ratio of the prior probabilities of the variances should still be included in the ratio calculation, no?

Btw, I did a research internship last year building from your work on BSR (the report is available here and my implementation of the algorithm is here, although I make the mistake of including the redundant likelihood of the linear parameters).

I noticed also that you don't save all samples on the MCMC chains, but only the accepted jumps. I believe that in order for the MCMC chains to asymptotically converge with the posterior distribution we have to add a copy of the old sample to the chain if we reject the jump. I think that the performance difference I report might be because of this instead of being a real improvement. However, my personal conclusion after working for a few months on BSR is that there is no good way of getting a Bayesian output from the chains (i.e. confidence/uncertainty distributions about the discovered expressions). I elaborate a bit on this on the first paragraph from page 25 from my report. If you are only interested on the last/best expression from the chain this is not an issue, but I think it can be problematic to just select the last/best expression.

I also remember noticing some other non-technical details about your article (e.g.

  1. I don't think you report which hyperparameters you used during your simulations;
  2. that your repository is only linked on arxiv but not on the paper, I found that you shared your code months after finding your preprint;
  3. or that the solution of the Jacobian is not on the text).

If you are interested I'd be happy to give more feedback on your paper if you ping me once you release the new version ๐Ÿ˜„ .

Best of luck with the project !

Hi @cobac ! I still think that the prior does not need to be computed due to symmetry. I might be wrong.

Thank you so much for sharing your work! It is a great honor for me to know that my work inspires your exploration. I read your report - it is very well-written, detailed and rigorous, with extensive experiments. I agree that different from scalar outputs, it is hard to output something with the tree structure. Our method, as the first exploration towards this direction, is far from being perfect. Thanks for your fantastic work! It also helps my understanding. I wish you best luck with your future endeavors!