the score may not reflect the tools' performance
NEUZhangy opened this issue · 15 comments
Hi, I through using FPR may overlook some cases. For example, if a project includes a total of 1005 case (1000 security usage and 5 vulnerability usage), a tool report 45 vulnerability with 5 TP and 40 FP, the TPR is high and FPR is low, the score would show that this is an ideal detection. But as a matter of fact, the precision is pretty low in such a case, Should we think it is not a good detection?
I don't understand your question. What do you mean by security usage vs. vulnerability usage? You mean 1000 False Positive test cases and 5 True Positive test cases, in your example? If that is what you mean, then yes, that would be bad. But we never intend to produce a Benchmark with such an imbalance of true positives vs. false positives. We strive to get them to be about even, so this scenario won't occur.
what would be nice to reflect in the score (or maybe another score) is that tools that detect too much TRP but too much FPR generate a negative impact in the software development life cycle due to the time that people will loose filtering the lies of the tool out of the truths of the tool
what I mean is that:
tool-1: 100 TPR 80 FPR = score 20
tool-2: 20 TPR 0 FPR = score 20
tool-2 should rank higher somehow IMHO
You might prefer tool 2, but others that REALLY care about security and don't mind wading through lots of False Positives might prefer tool 1. The Benchmark project doesn't decide that for users. You have to drill into the scorecards per tool to see if it is a 100-80 tool or a 20-0 tool, and then YOU get to decide which tool you prefer. The details are all there in the individual scorecards.
I strongly agree with you about the users freedom of choice and the scope of the benchmark project
It's just that if we take your argument to the extreme, then let's design a tool that says EVERYTHING is vulnerable, then those people who don't care wading through lots of false positives and REALLY (?) care about security will have to manually review the entire source code anyway, which defeats the purpose of using automated tools, which is the scope of the benchmark
the goal of the benchmark project is to validate AUTOMATED tools, and tools that require humans in the process (tool-1) are less useful than tool-2, the score does not reflect that nowadays (the AUTOMATED attribute of tools)
Well, one easy change would be to show the math on the home page scorecard. For example, rather than just showing:
Toolname vX (20%) - it could instead show something like Toolname vX (20% [X - Y])
Using your initial numbers, something like:
Tool-1 vX (20% [100 - 80])
Tool-2 vX (20% [20 - 0])
A bit ugly in my opinion, but provides more information right up front.
FYI - The scorecard generator has been updated to include math like the above now on the charts on every page.
Maybe easier for human beings:
name version score* truths lies
collest-tool-ever 3.2 20% 100% 80%
*score = truths - lies
Not a lot of room on the pictorial scorecard for that. This information is actually right there on the same page in the table below the chart. See the table: Summary of Results by Tool
Isn't that sufficient?
I'm curious: Is there a specific reason to use TPR/FPR/ROC instead of precision/recall/accuracy?
According to the following post,
https://towardsdatascience.com/what-metrics-should-we-use-on-imbalanced-data-set-precision-recall-roc-e2e79252aeba
We care more about the accuracy of vulnerability detection, instead of the accuracy labeling benign code. Right?
We decided to use TPR and FPR as those terms are far more familiar and intuitive to the Benchmark audience. I've read the definitions of precision/recall/etc. dozens of times, and I can never tell you what they mean an hour later as they don't intuitively mean anything to me. I believe the medical community is far more familiar with and used to these terms but I don't think out community is.
Precision and recall are not that far away from TPR and FPR, e.g., TPR = TP/(TP+FN) = Recall
I think my major concern is about FPR = FP/(FP + TN). Ideally, a tool should have FPR=0 meaning that it does not have any false positive. However, if I build a naive tool that reports nothing, I can always get FPR=0 without difficulty. When we apply a vulnerability detection tool to any software program, it is quite likely that most code is benign and TN can be very large. In that case, even though two tools have significantly different FP counts, their measured FPR values will be very similar to each other because FP << TN. In such scenarios, FPR seems not quite helpful to reflect the effectiveness of tools.
The missing piece in TPR and FPR is precision= TP/(TP+FP), which measures the precision rate of tool-generated reports. Certainly, we would like to have a tool that has both high precision (with few FPs) and high recall (with few FNs). Therefore, it seems to me that they are more effective than TPR and FPR. Btw, as a researcher in the CS community, I feel that precision and recall are widely used in many subareas of CS. Since recently we are using OWASP benchmark to evaluate vulnerability detection tools, we are just curious why TPR and FPR were chosen.
I really like precision and recall: tool-1 reports 100% of relevant results with 30% of completeness
Very intuitive to me, while showing key metrics that allow to compare automated tools
the tool only reports to me 30% of the existing vulns, but 100% of the vulns reported are truthy
awesome!
I found the "reporting format" might have some misleading information, as @mengna152173 mentioned, TPR = TP / (TP+FN), which should be Recall's definition instead of precision.
I think someone else pointed this out to me and I hadn't fixed it yet. Thanks for letting me know. I'll try to get that fixed at least.
@NEUZhangy - you are correct. What you referenced should be Recall, not Precision. And this is now fixed on the OWASP site. I checked and the correct word is used in the Benchmark scorecard generator, it was just wrong on the wiki page.
I think this discussion was interesting, and 1 minor fix was made. But there is nothing else to do here, so closing this.