Question about Pixel-F1 metric

When evaluating the CASIA benchmark, I've encountered an issue that seems a bit unusual. In this project, the labeling convention is as follows:
Real/untampered images are labeled as negative or 0.
Fake/tampered images are labeled as positive or 1.
This labeling scheme is consistent with the pixel-level masks, where pixels with a value of 1 indicate tampered regions. This can be seen in the dataset implementation at:

IMDLBenCo/IMDLBenCo/datasets/abstract_dataset.py

Lines 74 to 82 in 1807684

    
           if gt_path != "Negative": 
        
               gt_img = self.img_loader(gt_path) 
        
               gt_shape = gt_img.size 
        
               label = 1 
        
           else: 
        
               temp = np.array(tp_img) 
        
               gt_img = np.zeros((temp.shape[0], temp.shape[1], 3)) 
        
               gt_shape = (temp.shape[1], temp.shape[0]) 
        
               label = 0

When I read the implementation of the F1 score, I suddenly realized a question: What is the Pix-F1 score for an authentic (untampered) image in this implementation?

IMDLBenCo/IMDLBenCo/evaluation/F1.py

Lines 122 to 124 in 1807684

    
           precision = TP / (TP + FP + 1e-8) 
        
           recall = TP / (TP + FN + 1e-8) 
        
           F1 = 2 * precision * recall / (precision + recall + 1e-8)

If we have an authentic image where all pixels are labeled as negative (0), and the model correctly predicts all pixels as negative, the F1 score will indeed be 0, as TP=0 TP=0 FP=0.

I conducted a small test to verify this behavior, and the results confirmed my suspicion. In my experiment, I loaded a dataset consisting of 50% authentic (untampered) images and 50% tampered images into the dataloader. Then, instead of using a model for prediction, I simply used the ground truth masks as the predictions.
When I calculated the overall Pixel-F1 score for this setup, I obtained a value of 0.5.

I know that TroFor may employ an inversion of foreground and background operation to address this problem.
https://github.com/grip-unina/TruFor/blob/f1f1b7410e6332b8123866bd6c149991cb641be1/test_docker/metrics.py#L22-L26

I'm a bit confused. Does this framework not allow for testing images where all pixels are negative? Or how can I correctly reproduce the testing protocol used in the paper?
If the foreground and background inversion operation is not performed, will it affect the method's performance?

Thanks!

Hi,
We noticed this issue during our early project. As stated here in the issue of IML-ViT, calculating pixel-level F1 score for an authentic image is meaning less. The evaluator for pixel-level does not follow the TruFor style as a Premute-F1 since it will make the result higher than the standard F1 score, which is also stated in our paper.

Thus, the actual experiment for pixel-level experiment in our paper is done by only including manipulated image during pixel-level F1 testing.

Thanks, it would be better to provide the JSON files for these benchmarks. Or add some guidence in the document.

We apologize for the delay as we are currently busy revising the paper for the camera-ready version. We will shift our focus to this matter once the revisions are complete. Your understanding is greatly appreciated.

In fact, due to the issue of variable reducing across multiple GPUs, we have considered several solutions to try and simultaneously compute both image-level and pixel-level metrics. However, it’s impossible to avoid the discrepancy in the number of 'real images' across different GPUs, which leads to inaccuracies in the merging process and metric calculations.

Therefore, the simplest solution at the moment seems to be using an alert check. When pixel-level metrics are being used and a mismatch in real images is detected, the system would issue a warning.

Alternatively, we could use single-GPU inference when calculating both types of metrics.

Given your excellent coding skills and insightful. suggestions. if you have the time, I would love to discuss this issue with you for further collaboration. You can reach me at xiaochen.ma.cs@gmail.com.

	if gt_path != "Negative":
	gt_img = self.img_loader(gt_path)
	gt_shape = gt_img.size
	label = 1
	else:
	temp = np.array(tp_img)
	gt_img = np.zeros((temp.shape[0], temp.shape[1], 3))
	gt_shape = (temp.shape[1], temp.shape[0])
	label = 0

	precision = TP / (TP + FP + 1e-8)
	recall = TP / (TP + FN + 1e-8)
	F1 = 2 * precision * recall / (precision + recall + 1e-8)