d-ailin/GDN

Reproduction of results on the WADI dataset

Closed this issue · 17 comments

Hi,
thanks for your work and the code release! I am currently trying to reproduce the results to better understand the whole concept and your implementation. The problem is that the results I get on the wadi dataset are really bad and in no way comparable to yours. My issue is similar to #47 .

My run.sh for WADI looks like this (similar to #17 ):
image

And the resulting values are:
image

I already compared my own wadi csv files to the demo data you provided and they match. Furthermore, I calculated the number of anomalies in my csv dataset and compared them to the expected value that the pdf file, that is attached to the files from iTrust, provides and they match as well.

I changed the random seed multiple times, but it only improved the result marginally.

The results for MSL seem to be okay:
image

Did you ever encounter this problem?
Do you have any advice for me how I could solve this?

And would you be interested in a Dockerfile for the dependencies of this repository?

Hi, thanks for your interest in our work.
Just to check, we have used the A1 version of WADI, as there are 2 versions of WADI. Their provided description document(pdf) might have typo, e.g., the 9th attack's start date should be 11/10/17. Thus, it might cause mislabeling the anomalies for such case. As you might have already followed the preprocessing instruction under the scripts directory, could you help check if your labeling is reasonable? I guess the huge gap here is usually caused by the data pipeline. Thanks!

Thanks a lot for the fast reply! I am using the A1 version of WADI and realized the pdf must contain a typo and used the corrected start date for the 9th attack. Therefore, the labeling should be reasonable. Do you maybe have a pretrained model lying around, that I could use for tests?
Furthermore, I counted the number of attacks in WADI_attackdata_labelled.csv and the result was 9948 and the loaded dataframe has shape (172801, 130). After downsampling with your preprocessing script I have 17280 test rows with a total of 1006 attacks (= 5.82%).

And are the numbers that I reported for the MSL dataset as expected? Maybe my docker image isn't set up properly?

I tried debugging my data and code further and found the problem! As @d-ailin already suspected, the problem lay with the data. My german Excel version didn't like the csv file, since Germany uses the decimal comma instead of the decimal point. That somehow changed the data and led to bad results.

I was able to run the script and get results that are plausible:
image

Hi,

I have an additional question. The highest F1 scores I get are around .54 with a precision of up to .8. In your paper you report a precision of .97 with an F1 score of .57. Were most of your runs performing this way, or was that a particular good run?
Sorry for all these questions and thanks for your patience!

Hi,
I've got your original question, and I have a very low precision and a very high recall. Could you tell me more details about how do you solve this question?
QQ截图20221101232435

Hi @peerschuett, that's totally fine. As mentioned in the paper, we will select the threshold based on the validation set (which is randomly split from the normal data) to compute the F1 score. Thus, the result might be slightly varied when using different validation sets with different seeds. The reported results are gained based on the particular validation setting. Just to check, I have previously uploaded the used feature list for WADI list together with processed data in the Google Drive link. If you haven't used it, you could just replace the feature list file with it without the need of changing the data file.

Also, feel free to let me know if you still need the pretrained model and leave your email address here or contact me via email. Thanks.

@dxbj986: My problem with high recall and low precision was caused by my faulty data preprocessing. The application I used to open the .csv files (Excel) changed the numbers. I solved it by saving a file that only contains the attack data (I can send it to you, if you give me your email address) and adding the content of this file with a small python script to the wadi csv file.
Afterwards, I compared my data to the demo data and it matched. The results were fine then :)

@peerschuett Thank you for your reply. I have solved this problem based on your earlier comment. The problem was with the dataset, and I had this problem because I only used the wadi_demo.csv file provided by the author, which I resolved later using the standard wadi dataset.

Hi @d-ailin, thanks again for your fast responses! I used the list.txt you proposed with the 112 features, instead of the original 127. The resulting best performances were:

For report="best" a F1 score of 0.5526 with a precision of 0.8547 and recall 0.4072. The parameters were lr=0.001, val_ratio=0.2, decay=0.01 and otherwise similar to my initial comment.

For report="val" a F1 score of 0.5284 with a precision of 0.8562 and recall 0.3821. The parameters were lr=0.001, val_ratio=0.02, decay=0.0 and otherwise similar to my initial comment.

Furthermore, I changed the optimizer from Adam to AdamW.

How did you generate the list with only 112 sensors? Did you choose them based on their contribution to the results?

Thanks for your time and interest, too :)! For these cases, I think you could try with some other random seeds as well. For example, I provide some plausible checkpoints with some other random seed(e.g., 8) via this link (https://drive.google.com/drive/folders/1aqzvS18iOmfbb56PMZA2LMcWxUBr3txS?usp=sharing), the results from my side are:
For report="best" a F1 score of 0.5842 with a precision of 0.7980 and recall 0.4612.
For report="val" a F1 score of 0.5792 with a precision of 0.9519 and recall 0.4163.
The parameters are similar to your initial comment except the seed, yet I use Adam optimizer. You could load the checkpoint with '-load_model_path' argument.

For the sensor selection, I have checked the removed sensors are the cases with very strong distribution shift between train and test data or very sparse spike signals, but these sensors are not necessary for the detection but might affect the stability of the modeling, so we remove them out of such consideration. Actually, our algorithm also contains some part to mitigate the effect of these cases, such as Eq (12), such that you might find that the performance under the original 127 would be just similar to using the provided 112 features but might be less stable. Hope this clarifies :).

Thanks a lot for the checkpoint!
Sadly, I can't load them with the code from this repository, because I get the following error:

RuntimeError: Error(s) in loading state_dict for GDN: Missing key(s) in state_dict: "out_layer.mlp.0.weight", "out_layer.mlp.0.bias", "out_layer.mlp.1.weight", "out_layer.mlp.1.bias", "out_layer.mlp.1.running_mean", "out_layer.mlp.1.running_var", "out_layer.mlp.3.weight", "out_layer.mlp.3.bias". Unexpected key(s) in state_dict: "out_layer.mlp_1.weight", "out_layer.mlp_1.bias", "out_layer.bn.weight", "out_layer.bn.bias", "out_layer.bn.running_mean", "out_layer.bn.running_var", "out_layer.bn.num_batches_tracked", "out_layer.mlp_2.weight", "out_layer.mlp_2.bias".

I guess you changed the out_layer somewhere in the past and didn't commit the change?
For clarification: I tried loading the model with a clean repository state, where I didn't add any changes.

@peerschuett So sorry for it. These mismatched weights are due to I have renamed some variables locally, and this could be simply fixed just by renaming their key names to align, e.g., renaming "out_layer.mlp_1.xx" to "out_layer.mlp.0.xx". Anyway, I have fixed this issue, tested it with the clean repository, and it seems to work well now, and the performance from my side is the same as my previous comment. I have updated the checkpoint in the link, and you could feel free to test the updated one as well. Thanks!

Thanks for updating the weights @d-ailin ! I was able to successfully load the weights into the model.
Sadly, I still get worse results: For report="best" a F1 score of 0.5468 with a precision of 0.7251 and recall 0.4393.
The only reason for this that I can think of is a faulty preprocessing of the data on my side or the random seed that isn't identical for our machines.
I will test around a bit :D And again thanks a lot for your help and your fast responses and willingness to help!

Sadly, I still can't reproduce the numbers and don't know any reason why. I compared my data to the csv demo data and they match. Maybe the random seed is the culprit, but a difference in 0.04 in the F1 score is quite big for that. Do you have any ideas what I could try?

I guess it is probably due to some inconsistent preprocessed datasets, e.g. the labeling, as the best performance doesn't rely on the random seed and the csv demo data is only containing some very beginning period. Though I might not be able to share the whole processed data as the data is owned by iTrust, and they might have some restriction with the sharing, maybe I could still share some of the other processed parts with you to help with checking. Could you leave your email address here or contact me via ailin@u.nus.edu if it is convenient? Thanks!

Thanks, I wrote you an email!

@d-ailin, Thank you for your excellent open source work. I see you said you used a processed Wadi dataset,Could you give me your Google Drive link about the used feature list for WADI list together with processed data?