passing ratio information in fit() derived from test-dataset
DhavalRepo18 opened this issue · 3 comments
Shall we avoid passing ``ratio=sum(self.data['y_test']) / len(self.data['y_test'])''
Lines 206 to 207 in f3a9e94
Thank you for addressing this problem! Actually the ratio is only for the hyper-parameter tuning of the unsupervised methods. Although we use the default hyper-parameter settings in the ADBench paper, we additionally provide the codes for automatically tuning the hyper-parameter based on the labeled anomalies (to construct an additional validation set). Therefore the ratio is necessary for calculating the number of normal samples.
For example, if we have 10 labeled anomalies, we need to provide the anomaly ratio (e.g., 5%), so that [(1-5%) / 5%] * 10 ≈ 190 normal samples are required for constructing a "subset" of the original dataset to evaluate the unsupervised method. The ratio option is ignored for both semi-supervised and fully-supervised algorithms and only present for API consistency by convention.
We agree that using the ratio calculated by the testing set may cause confusion and we have removed this default value for the ratio option. Thanks a lot for this kind advise!
@Minqi824 Thanks for the prompt reply. I have couple of questions:
-
Does output of 'score_test = self.clf.predict_score(self.data['X_test'])' depends on ratio?
-
ratio is only for the hyper-parameter tuning of the unsupervised methods --> can you point the code?
@DhavalRepo18 Thanks again for your advices :) !~
For Question 1, the output of anomaly score on the testing set does not depend on the ratio, since we use the AUCROC and AUCPR metrics for evaluating AD algorithms, which do not rely on the specific threshold (or ratio) for calculating the results.
For Question 2, the corresponding codes are as follows:
For the unsupervised methods wrapped in PyOD:
Lines 91 to 113 in 4040da1
For the unsupervised method DAGMM:
Lines 39 to 60 in 4040da1
Although not presented in the paper, we found that using additional labeled anomalies for tuning the hyper-parameters of unsupervised AD algorithms would sightly improve their performances.