About "detection adjustment" in the line 339-360 of solver.py
wuhaixu2016 opened this issue · 9 comments
Since some researchers are confused about the "detection adjustment", we provide some clarification here.
(1) Why use "detection adjustment"?
Firstly, I strongly suggest the researchers read the original paper Xu et al., 2018, which has given a comprehensive explanation of this operation.
In our paper, we follow this convention because of the following reasons:
- Fair comparison: As we stated in the Implementation details section of our paper, the adjustment is a widely-used convention in time series anomaly detection. Especially, in the benchmarks that we used in our paper, the previous methods all use the adjustment operation for the evaluation of these benchmarks Shen et al., 2020. Thus, we also adopt the adjustment for model evaluation.
- Real-world meaning: Since one abnormal event will cause a segment of abnormal time points. The adjustment corresponds to the "abnormal event detection" task, which is to evaluate the model performance in detecting the abnormal events from the whole records. This is a very meaningful task for real-world applications. Once we have detected the abnormal event, we can send a worker to check that time segment for security.
In summary, you can view the adjustment as an "evaluation protocol", which is to measure the capability of models in "abnormal event detection".
(2) We have provided a comprehensive and fair comparison in our paper.
- All the baselines that we compared in our paper are also evaluated with this "adjustment". Note that this evaluation is widely used in the previous papers for the experiments on SMD, SWaT, and so on. Thus, the comparison is fair.
- For a comprehensive analysis, we also provide a benchmark for the UCR dataset in Appendix L, which is from KDD Cup. The anomalies in this dataset are mostly recorded only at a single time point. Thus, if you want to obtain the comparison on single-time-point anomaly detection, this dataset can provide some intuitions.
If you still have some questions about the adjustment, welcome to email me and discuss more (whx20@mails.tsinghua.edu.cn).
您好,请问一下,如果要将您的这个成果应用在别的领域的话。您异常检测出来的值和怎么导出来然后从原始数据中过滤掉?
有无学者知道如何解决不知道测试标签的情况下解决该问题嘛
您好,这个调整只是用于计算metric,如果您是想用于部署的话,直接注释掉就可以了 @xiaobiao998
这个代码的性能全靠看了GT后的调整,在现实中没有GT,也就是你注释掉的效果,如果我是reviewer,我会强烈要求去掉这个adjustment,完全是为了好看的不现实步骤。这个不能叫fair compare,因为adjustment有可能利于作者提出的方法,而且这个adjustment现实中不存在,毫无意义。
这个代码的性能全靠看了GT后的调整,在现实中没有GT,也就是你注释掉的效果,如果我是reviewer,我会强烈要求去掉这个adjustment,完全是为了好看的不现实步骤。这个不能叫fair compare,因为adjustment有可能利于作者提出的方法,而且这个adjustment现实中不存在,毫无意义。
烦请仔细阅读上面的说明。为了清晰,我们提供一个中文版本,见下
(1)**公平比较:**自从2018年Xu等论文之后,【大部分工作】全都遵循这一调整,所以我们也使用这个技巧
(2)**实际意义:**您可以考虑这样一个场景,在实际部署中,我们的模型定位出了一个异常点,可以派工人前去查看前后一段时间,所以完全不需要有gt,依然可以实现现实部署。所以,使用调整之后,可以理解为是“基于异常事件”的指标。
Hi @wuhaixu2016 ,
Thank you for sharing your code and making it easy to use and reproduce the results.
I’d like to clarify my understanding of the code: From a theoretical perspective, the following lines introduce information leakage from the training set into the test set. Specifically, the model's predictions (pred
) are being directly adjusted based on the ground truth labels (gt
). This means that information from the actual labels is being used to modify the predicted outcomes, resulting in evaluation metrics that may be overly optimistic. Adjusting pred using the ground truth during evaluation allows information that would not be available in a real deployment scenario to influence the results.
Practically speaking, since we wouldn’t have access to ground truth labels during deployment, I believe these lines should be omitted. After removing them, here are the results I obtained:
SMD dataset:
Accuracy: 0.9920, Precision: 0.8894, Recall: 0.9235, F-score: 0.9061
(With data leakage)
Accuracy: 0.9543, Precision: 0.1047, Recall: 0.0133, F-score: 0.0236
(After omitting the lines)
SMAP dataset:
Accuracy: 0.9901, Precision: 0.9361, Recall: 0.9900, F-score: 0.9623
(With data leakage)
Accuracy: 0.8648, Precision: 0.1275, Recall: 0.0097, F-score: 0.0180
(After omitting the lines)
PSM dataset:
Accuracy: 0.9854, Precision: 0.9729, Recall: 0.9745, F-score: 0.9737
(With data leakage)
Accuracy: 0.7181, Precision: 0.2854, Recall: 0.0110, F-score: 0.0212
(After omitting the lines)
If my understanding is incorrect, could you please clarify? Alternatively, if you have any suggestions for addressing this issue in a practical way, I would greatly appreciate your input.
Thank you!
您好,仔细拜读了文章后我觉得思路非常好,也很契合我的想法。但是我试了下,注释掉adjustment后,在SMD数据集上预测分数对label几乎没有预测能力~不知道性能提升会不会是由于结果随机的情况下通过adjustment大大提高了性能。为了验证,我只用一个数据训练模型一次,输入数据是(1,100,38)的一个数据,adjustment后性能依然保持很高(f-score:0.8765),不知道我这么验证有没有道理,但是我希望有方法可以证明预测结果和目标变量是相关的。也许我可以尝试使用中间输出的emb做下游分类预测?我尝试用作者的理论来理解这个操作,现实世界中没有真实label的情况下,如果让工人根据预测结果检索上下游节点,在这个本身没有区分度的情况下,是不是相当于随机给工人节点让工人检索呢?这个过程可能由于预测点落在了异常区间,而异常区间连续性比较好,导致最后结果很好。我的困惑是这个思路真的很好,希望有方法能使结果和目标变量有相关性,或者说有预测能力,不知道您这边怎么看?以及有什么方法可以证明预测结果和目标相关,或者在真实的业务场景中可以如何使用?(直接作为分数输出显然是不可以的~)
您好,仔细拜读了文章后我觉得思路非常好,也很契合我的想法。但是我试了下,注释掉adjustment后,在SMD数据集上预测分数对label几乎没有预测能力~不知道性能提升会不会是由于结果随机的情况下通过adjustment大大提高了性能。为了验证,我只用一个数据训练模型一次,输入数据是(1,100,38)的一个数据,adjustment后性能依然保持很高(f-score:0.8765),不知道我这么验证有没有道理,但是我希望有方法可以证明预测结果和目标变量是相关的。也许我可以尝试使用中间输出的emb做下游分类预测?我尝试用作者的理论来理解这个操作,现实世界中没有真实label的情况下,如果让工人根据预测结果检索上下游节点,在这个本身没有区分度的情况下,是不是相当于随机给工人节点让工人检索呢?这个过程可能由于预测点落在了异常区间,而异常区间连续性比较好,导致最后结果很好。我的困惑是这个思路真的很好,希望有方法能使结果和目标变量有相关性,或者说有预测能力,不知道您这边怎么看?以及有什么方法可以证明预测结果和目标相关,或者在真实的业务场景中可以如何使用?(直接作为分数输出显然是不可以的~)
adjustment的原表述如下
因此只要异常区间中有至少一个点被标注为异常,就视为"正确检测"
所以我推测,这个指标只适用于异常连续性很好的任务,而不适用于点异常任务.这个指标是为区间异常检测设计的.
不知道这个理解是否有错误?