some questions about the experimental results and threshold Selection.

Question

some questions about the experimental results and threshold Selection.

Opened this issue 2 years ago · 3 comments

Hello , thanks for sharing your excellent work!
I have some questions about the experimental results and threshold Selection.
On the SWAT dataset, I conducted experiments using the two threshold selection methods you provided. When I use the maximum error of the validation set as the threshold, F1 is 0.5. When I use the second method to search the optimal threshold on the test set, F1 is 0.80, which can achieve the effect of the paper report. Based on this, I have two questions:

Which threshold selection method is corresponding to the results reported in your paper? In your experiment, do the results of the two threshold selection methods differ greatly?
As for the second threshold selection method, I understand that it is to select the threshold that makes F1 the highest under the assumption that the test set anamoly label is known. But I have a question. The label of the test set is invisible in the real scene, so is this reasonable? I see that other works in recent years also adopts the optimal threshold method, so do we focus on the optimal F1 that can be achieved in theory?

I look forward to receiving your reply. Thank you very much!

Answer 1 · 2022-12-12T10:01:32.000Z

I have the same Issue with the results. I can't come close to the results of the paper with the validation Threshold which is why I'm also curious about question one.

Answer 2 · 2022-12-12T10:01:52.000Z

这是来自QQ邮箱的自动回复邮件。你好，您的邮件我已收到，祝您心情愉快。

Answer 3 · 2022-12-13T15:30:56.000Z

Thanks for your interest in our work.

The reported results are based on the validation set threshold, such that it might be varied with different random seeds. Based on some seeds, the results of using validation-based and best F1 are very close, but for some seeds, there might be some variation.
Yes, there are some works using best-F1 as evaluation metrics. As F1 scores require threshold selection, I think a better way for evaluation could also consider using some threshold-agnostic metrics, e.g., AUROC, together with F1 related metrics.