some questions about the experimental results and threshold Selection.
Opened this issue · 3 comments
Hello , thanks for sharing your excellent work!
I have some questions about the experimental results and threshold Selection.
On the SWAT dataset, I conducted experiments using the two threshold selection methods you provided. When I use the maximum error of the validation set as the threshold, F1 is 0.5. When I use the second method to search the optimal threshold on the test set, F1 is 0.80, which can achieve the effect of the paper report. Based on this, I have two questions:
-
Which threshold selection method is corresponding to the results reported in your paper? In your experiment, do the results of the two threshold selection methods differ greatly?
-
As for the second threshold selection method, I understand that it is to select the threshold that makes F1 the highest under the assumption that the test set anamoly label is known. But I have a question. The label of the test set is invisible in the real scene, so is this reasonable? I see that other works in recent years also adopts the optimal threshold method, so do we focus on the optimal F1 that can be achieved in theory?
I look forward to receiving your reply. Thank you very much!
I have the same Issue with the results. I can't come close to the results of the paper with the validation Threshold which is why I'm also curious about question one.
Thanks for your interest in our work.
-
The reported results are based on the validation set threshold, such that it might be varied with different random seeds. Based on some seeds, the results of using validation-based and best F1 are very close, but for some seeds, there might be some variation.
-
Yes, there are some works using best-F1 as evaluation metrics. As F1 scores require threshold selection, I think a better way for evaluation could also consider using some threshold-agnostic metrics, e.g., AUROC, together with F1 related metrics.