Sweeper / scorer results do not seem to match when running NAB

Question

Sweeper / scorer results do not seem to match when running NAB

michellecindy opened this issue 5 years ago · 13 comments

I want to use NAB to run some experiments for my thesis, but when I run the latest version from github (before the upgrade to Python 3) commit, I get very different results. I was digging into it and I think I found where it is going wrong, it seems to be happening because of Sweeper. I am not sure what has been changed exactly since adding the Sweeper function in January 2019, but I think something might be going wrong in the scoring step now.

The main issue is that the results show a score of -0.11 (standard profile) for every true negative, which is obviously wrong...

When I run the version of commit 53586c4, from November 2016, I do get the right results matching the results in the result files.

Also, when trying to use the latest committed version from October this year, after the upgrade to Python 3, I cannot even get it to run because of dependency issues.

Just wanted to let you know in case someone wants to look into it.

Answer 1 · 2019-11-18T18:43:42.000Z

Thank you for the report, @michellecindy.

The main issue is that the results show a score of -0.11 (standard profile) for every true negative, which is obviously wrong...

I want to try and replicate this with the least effort possible. Do you see this problem for every detector, or just the nupic detectors?

when I run the latest version from github (before the upgrade to Python 3) commit, I get very different results

I will try to replicate then with this commit in Python 2, correct? 4346b67

As a part of the move to Python 3, I ran all the detectors thorugh NAB again in Python 3 and compared them to the scores and they looked good. But I did not compile and run the latest python 2 NAB code at 4346b67. I will assume you are correct that 53586c4 is a good SHA, and I'll see if I can replicate your -0.11 score locally.

Python 3, I cannot even get it to run because of dependency issues.

Are you trying to use the HTM detectors in Python 3? Because that won't work. They must be run in python 2, and we are working on the dockerfiles to support that. As soon as they are ready we will release a new NAB version.

Answer 2 · 2019-11-18T18:49:41.000Z

I will try to replicate then with this commit in Python 2, correct? 4346b67

I don't think I am going to do this now, because I realized that you probably just picked a SHA far enough back and tested it. You suggested sweep scoring had something to do with this, so I will target the changes that occurred in #328 and run NAB before / after.

Answer 3 · 2019-11-18T19:17:17.000Z

@michellecindy Can you help out? I am getting the same results I got when I validated #328. I don't see the -0.11 results for true negatives. Can you checkout 4346b67 and run python run.py --score --normalize. It should not take too long. Then look at the final_results.json file so we can compare. Mine only contains the insignificant differences I noted in #328 (comment). Do you see different results? If so, please describe your python environment and operating system / container. Thanks.

Answer 4 · 2019-11-18T19:40:42.000Z

Just to double check, I also ran at 4346b67 (the one you ran and said was good) and 4346b67 (the last python 2 SHA). Both of them looked good to me (only minor number rounding differences).

Are you actually running the detectors? The problem might not be with the scoring at all. Perhaps something is not right with your detector runtime. Please tell me what detectors are returning the suspicious results.

Answer 5 · 2019-11-18T21:56:58.000Z

Hey Matthew, I wasnt running any of the (real) detectors yet, I noticed this when running the random detector. To reproduce for example try running the combined windows tiny file with random detector and then detect, optimize and score. Then you should get the weird results in the result csv data files for each separate data point (so not the totals file).

…

On Mon, 18 Nov 2019 at 20:40, Matthew Taylor ***@***.***> wrote: Just to double check, I also ran at 4346b67 <4346b67> (the one you ran and said was good) and 4346b67 <4346b67> (the last python 2 SHA). Both of them looked good to me (only minor number rounding differences). Are you actually running the detectors? The problem might not be with the scoring at all. Perhaps something is not right with your detector runtime. Please tell me what detectors are returning the suspicious results. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#357?email_source=notifications&email_token=AFIB62P2OGZRZRS67SLFJRDQULVUDA5CNFSM4JOTKZV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEELVDYI#issuecomment-555176417>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFIB62I3TPRFEHJ2JSILKELQULVUDANCNFSM4JOTKZVQ> .

Answer 6 · 2019-11-18T23:20:51.000Z

The CSV files you're referring to are intermediary files, not the final scores. If you were showing that different final results were appearing in final_scores.json after running python run.py -d random --score --normalize, I would be concerned.

I was digging into it and I think I found where it is going wrong, it seems to be happening because of Sweeper.

Why do you think this?

Answer 7 · 2019-11-19T08:44:40.000Z

Yes so the final score for a certain detector is based on these intermediary scores if I understand it correctly? I think something is going wrong because if I run:
-d random --detect --optimize --score --windowsFile labels/combined_windows_tiny.json
and look at the intermediary scores for the files 'random_nyc_taxi.csv' and 'random_rogue_agent_key_hold.csv' it shows -0.22, -0.11, -0.11 (using the default thresholds) for all rows, which means all these rows are scored as if they are false positives.

This is confusing especially since the optimizer found the 'optimal' thresholds in this case to be 0.99, and almost all anomalyScores for the rows are below that, which means the majority of the rows should be true negatives, not false positives.

I think it might be in the calcSweepScore method, but not sure.

Answer 8 · 2019-11-21T17:54:00.000Z

So @michellecindy at this point I'm not sure what you have reported is actually a problem with the scoring, because you've not actually run the detectors and updated the final scores. We haven't been able to successfully run the command you gave above. If you run a detector and score it, and update the final results, and there is a discrepancy at that point I would be concerned. If you can show that, please let us know!

Answer 9 · 2019-11-23T12:48:43.000Z

Hi Matthew, to clarify again: I did run the (random) detector and I think there is a problem with the scoring function. I think you cannot reproduce it because to score it uses all already created results files, so if you first delete all result files in the results/random folder and then run the command above you should be able to reproduce it. If not, can you tell me where it is going wrong in reproducing?

…

On Thu, Nov 21, 2019 at 6:54 PM Matthew Taylor ***@***.***> wrote: So @michellecindy <https://github.com/michellecindy> at this point I'm not sure what you have reported is actually a problem with the scoring, because you've not actually run the detectors and updated the final scores. We haven't been able to successfully run the command you gave above. If you run a detector and score it, and update the final results, and there is a discrepancy at that point I would be concerned. If you can show that, please let us know! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#357?email_source=notifications&email_token=AFIB62JYE6RDUR73CCW5WB3QU3DLTA5CNFSM4JOTKZV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3DFKI#issuecomment-557200041>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFIB62MQOSPSJXJNQUMTJO3QU3DLTANCNFSM4JOTKZVQ> .

Answer 10 · 2019-11-25T16:53:07.000Z

If what you are saying is correct, then our final results might be wrong. Is that true? I am just trying to understand the urgency of this issue before committing my time to it.

Answer 11 · 2019-11-25T16:58:54.000Z

@michellecindy BTW thank you for your patience. I am not very familiar with this codebase.

Answer 12 · 2019-11-25T16:59:48.000Z

I think the results in your github are not wrong, because it looks like they have not be re run since the last changes

…

On Mon, 25 Nov 2019 at 17:53, Matthew Taylor ***@***.***> wrote: If what you are saying is correct, then our *final results* might be wrong. Is that true? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#357?email_source=notifications&email_token=AFIB62KU7W5MMUMMWBBXAWTQVP7HHA5CNFSM4JOTKZV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFDB5XY#issuecomment-558243551>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFIB62OWKXGCSTZIB4CCAELQVP7HHANCNFSM4JOTKZVQ> .

Answer 13 · 2019-11-25T21:24:54.000Z

I checked out two SHAs and ran the command below and compared the final_results.json file at each time:

47a3452 (before sweep code was added)
f83024b (after sweep code was added)

Before each run, I removed all the previous result files. The command I ran was:

python run.py -d random --detect --score --optimize --normalize`.

The pertinent diff between the final_results.json file is:

     "random": {
-        "reward_low_FN_rate": 27.57489083329448,
-        "reward_low_FP_rate": 7.480034938338046,
-        "standard": 17.655439698217585
+        "reward_low_FN_rate": 27.57489083333333,
+        "reward_low_FP_rate": 7.4800349383189655,
+        "standard": 17.655439698232758
     },

The difference between these values in this (and the other) detectors was negligible enough that I did not even bother to update them in #328 (comment). I also just ran the same command on the current tip of master and I get the exact same random scoring.

I am still not convinced this is actually a scoring bug, but I may still misunderstand you. I would be very concerned if you were running NAB in any way and finding significantly different values in results/final_scores.json. I cannot replicate this with the random detector running the complete NAB detect/score/optimize/normalize as I showed above.

While I am closing this, I will reopen it if you can show me how to replicate a discrepancy in the final scores.