Metrics for 1k test images on MS COCO

I am sorry to create a new issue but I think this way it would be better since the doubt might be shared by a few other people.

Just to confirm, we evaluate on the 5 folds and report the best result on one of these samples(the best performing 1k sample) or we report the average value for all 5 sets? If the result is for the best performing 1k sample, is it not kind of "pick and choose" scenario?

Honestly, this is a wonderful code base and I think this is the most likely place to find a solution to my doubts :)

The metrics reported for the 1K results are averaged over 5 splits. Here is the line that does it:

vsepp/evaluation.py

Line 193 in 226688a

mean_metrics = tuple(np.array(results).mean(axis=0).flatten())