Mismatching evaluation code for FedKiTS19
akash-07 opened this issue · 4 comments
Dear authors,
The evaluate_dice_on_tests
function calls
y_pred = model(X).detach().cpu()
preds_softmax = softmax_helper(y_pred)
preds = preds_softmax.argmax(1)
y = y.detach().cpu()
dice_score = metric(preds, y)
and uses preds
in the metric
function.
However, the general purpose evaluate_model_on_tests
available in flamby.utils
uses y_pred
. This mismatch causes different metric values forFed_KiTS19
evaluation depending on the function used.
Seems like evaluate_dice_on_tests
is the correct version. Can you please confirm ?
Thanks !
Hello @akash-07 !
For models working on data modalities that are too big to fit in RAM we have functions that batch the inference such as evaluate_dice_on_tests
to measure prediction/ground truth match at the sample level, this is also the case for Fed-LIDC. They are the ones that are being used in the benchmark script.
I agree that it's not really clear. The metric functions also "work" but they are patch-wise.
Maybe @ErumMushtaq can provide more info ?
So long-story short evaluate_dice_on_tests is the "true" function to use to replicate benchmark numbers in the article, see here: https://github.com/owkin/FLamby/blob/main/flamby/benchmarks/benchmark_utils.py#:~:text=elif%20dataset_name%20%3D%3D%20%22fed_kits19,compute_ensemble_perf%20%3D%20False line 589 to 610 with a batch size of 2.
Thanks @jeandut, that helps !
I think most users of the repo would attempt using evaluate_model_on_tests
. Adding a note or some documentation regarding which functions to use per dataset would be helpful.
As another option, fixing evaluate_model_on_tests
also seems easier.
You are completely right about the lack of documentation on loss funtions I will open an issue about it.
However the goal of FLamby is not to impose metrics or anything upon the user it is to be a playground for FL research.