owkin/FLamby

Mismatching evaluation code for FedKiTS19

akash-07 opened this issue · 4 comments

Dear authors,

The evaluate_dice_on_tests function calls

 y_pred = model(X).detach().cpu()
 preds_softmax = softmax_helper(y_pred)
 preds = preds_softmax.argmax(1)
 y = y.detach().cpu()
 dice_score = metric(preds, y)

and uses preds in the metric function.

However, the general purpose evaluate_model_on_tests available in flamby.utils uses y_pred. This mismatch causes different metric values forFed_KiTS19 evaluation depending on the function used.

Seems like evaluate_dice_on_tests is the correct version. Can you please confirm ?

Thanks !

Hello @akash-07 !
For models working on data modalities that are too big to fit in RAM we have functions that batch the inference such as evaluate_dice_on_tests to measure prediction/ground truth match at the sample level, this is also the case for Fed-LIDC. They are the ones that are being used in the benchmark script.
I agree that it's not really clear. The metric functions also "work" but they are patch-wise.
Maybe @ErumMushtaq can provide more info ?

So long-story short evaluate_dice_on_tests is the "true" function to use to replicate benchmark numbers in the article, see here: https://github.com/owkin/FLamby/blob/main/flamby/benchmarks/benchmark_utils.py#:~:text=elif%20dataset_name%20%3D%3D%20%22fed_kits19,compute_ensemble_perf%20%3D%20False line 589 to 610 with a batch size of 2.

Thanks @jeandut, that helps !

I think most users of the repo would attempt using evaluate_model_on_tests. Adding a note or some documentation regarding which functions to use per dataset would be helpful.

As another option, fixing evaluate_model_on_tests also seems easier.

You are completely right about the lack of documentation on loss funtions I will open an issue about it.
However the goal of FLamby is not to impose metrics or anything upon the user it is to be a playground for FL research.