Where are predictions re-scaled during training?
Closed this issue · 6 comments
You mention in the paper that all but a couple of the datasets were log-transformed prior to training (a reasonable thing to do), but I don't see in the repo where they were transformed back to their original values for reporting MAE or RMSE (which are, of course, scale dependent)? Could you point out where this happens in the code please, thanks!
Thank you for your interest. We have carefully considered how to best report the performance of our models. We agree that it is often considered standard practice in some fields to convert RMSE and MAE values back to their original units during reporting to make it more straightforward for human interpretation. However, it is very common in the ADMET field to report MSE and RMSE on logarithmic values (e.g. pIC50). For example, we noted that multiple of the original publications of our benchmarking datasets also determined RMSE or MAE values for machine learning models trained on these datasets and reported these values in terms of their log transform (e.g., Zheng, S., (2020). JCIM, 60(6), 3231-3245.; Wang, (2016). JCIM, 56(4), 763-773.). This is based on two reasonings:
First, many tasks in ADMET span multiple orders of magnitude and thereby calculating RMSE/MAE on original values will give emphasize to large values, which are often less relevant. For example, a difference in predicted/measured LD50 between 10uM and 9uM is practically irrelevant but causes a MAE of 1000nM, while a difference in LD50 between 1nM and 100nM might be very important but only has an MAE of 99nM. Calculating these on a logarithmic scale instead leads to much more useful error estimates that are unified across scales. This is similar to mean squared log errors (MSLE) that are commonly reported when “outliers” plague the data.
Second, given that several of our ground truth datasets have been derived by the original authors from sigmoidal curves fitted on log-scaled experimental concentrations, it is mathematically invalid to calculate mean values and standard errors on the original units instead of the logarithm. See for example:
https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_why_prism_fits_the_logec50_rat.htm
https://www.graphpad.com/support/faq/how-should-i-compute-an-average-ec50-for-several-experiments-how-can-i-express-its-uncertainty-or-standard-deviation/
Additionally, the logarithmic function is a monotonic function which preserves order which means the conclusions in terms of relative model comparisons are not impacted when calculating MSE/RMSE on logarithmic scale.
Thanks for the thorough reply!
I broadly agree that, especially when the reference dataset is provided already in the log-space by subject matter experts, it makes sense to train and report accuracy on the log-space. Esepcially when considering the mathematical validity of it as well (thanks for those links!). I'll note though that in the fubrain study (Esaki et al.), they state that the RMSE and R^2 values are reported using the actual fractions, not the log fractions. See this link: https://pubs.acs.org/doi/full/10.1021/acs.jcim.9b00180#:~:text=Performance%20Evaluation to their Performance Evaluation section, which shows that the formula they used for reporting R^2 and RMSE uses the un-logged values. They do look at the log space in Table 1 (https://pubs.acs.org/doi/full/10.1021/acs.jcim.9b00180#:~:text=build%20predictive%20models.-,Table%201.,-Summary%20of%20the) but it seems only to verify that the training and testing sets they established were not statistically different. I may be wrong here, I'm new to this, so please let me know if I am misinterpreting!
Agreed as well to the last point about inter-model comparisons not being affected.
Thanks for noting that, Esaki et al. did use values bounded from 0 to 1 for their analysis. This range does not span multiple orders of magnitude and thereby avoids the issue of giving particular emphasize to large values that could occur with other datasets that were used.
Just so I'm sure, the reported accuracy of DeepDelta on the fubrain dataset (RMSE 0.830 ± 0.023) is on the log transformed fraction, whereas in the original fubrain paper the accuracy (RMSE 0.44) is on the fraction? If so, how does DeepDelta (and Rf, Chemprop, etc.) compare to the original Esaki et al. model?
Our reported values are on the log transformed difference in values between molecule pairs instead of the log transformed absolute value of a single molecule or absolute value of a single molecule. This is because we generated a new machine learning task in which every datapoint is composed of a pair of molecules and the objective variable is the difference in their properties. Our goal with this is not to evaluate our models’ abilities to predict absolute values, but instead predict property differences between molecules. As such, our results are not directly comparable to the Esaki et al. model, because we are evaluating on a different machine learning task.
Great, thanks again for the thorough explanations!