MOSES node type distribution doesn't sum to one

Question

MOSES node type distribution doesn't sum to one

asiraudin opened this issue a year ago · 1 comments

Hello Clément,

When processing the MOSES dataset with MiDi compute_all_statistics code, I noticed that the node_type argument contains incorrect values. Here is what I get for the train dataset :
[0.72200687 0.1364436 0.10383305 0.01433876 0.01637907 0.00546271 0.00153594, 0.0]
while the hard coded values in Moses dataset are
[0.722338, 0.13661, 0.103549, 0.1421803, 0.163655, , 0.005411, 0.00150, 0.0]

On what split did you compute the marginal distributions ? It seems that some values differ by an order of magnitude, and the distribution doesn't sum to 1.

Best,
Antoine

Answer 1 · 2023-11-22T15:35:15.000Z

Hello Antoine, these values were computed over the full dataset. The reason is that if an atom type appears in the test set but not in the training set, it will result in a NLL of +infinity because the probability of generating it will be 0. It is the same for the distribution of the number of nodes.