Minqi824/ADBench

passing ratio information in fit() derived from test-dataset

DhavalRepo18 opened this issue · 3 comments

Shall we avoid passing ``ratio=sum(self.data['y_test']) / len(self.data['y_test'])''

ADBench/run.py

Lines 206 to 207 in f3a9e94

self.clf = self.clf.fit(X_train=self.data['X_train'], y_train=self.data['y_train'],
ratio=sum(self.data['y_test']) / len(self.data['y_test']))

@Minqi824 @yzhao062

Thank you for addressing this problem! Actually the ratio is only for the hyper-parameter tuning of the unsupervised methods. Although we use the default hyper-parameter settings in the ADBench paper, we additionally provide the codes for automatically tuning the hyper-parameter based on the labeled anomalies (to construct an additional validation set). Therefore the ratio is necessary for calculating the number of normal samples.

For example, if we have 10 labeled anomalies, we need to provide the anomaly ratio (e.g., 5%), so that [(1-5%) / 5%] * 10 ≈ 190 normal samples are required for constructing a "subset" of the original dataset to evaluate the unsupervised method. The ratio option is ignored for both semi-supervised and fully-supervised algorithms and only present for API consistency by convention.

We agree that using the ratio calculated by the testing set may cause confusion and we have removed this default value for the ratio option. Thanks a lot for this kind advise!

@Minqi824 Thanks for the prompt reply. I have couple of questions:

  1. Does output of 'score_test = self.clf.predict_score(self.data['X_test'])' depends on ratio?

  2. ratio is only for the hyper-parameter tuning of the unsupervised methods --> can you point the code?

@DhavalRepo18 Thanks again for your advices :) !~
For Question 1, the output of anomaly score on the testing set does not depend on the ratio, since we use the AUCROC and AUCPR metrics for evaluating AD algorithms, which do not rely on the specific threshold (or ratio) for calculating the results.

For Question 2, the corresponding codes are as follows:
For the unsupervised methods wrapped in PyOD:

ADBench/baseline/PyOD.py

Lines 91 to 113 in 4040da1

def grid_search(self, X_train, y_train, ratio=None):
'''
implement the grid search for unsupervised models and return the best hyper-parameters
the ratio could be the ground truth anomaly ratio of input dataset
'''
# set seed
self.utils.set_seed(self.seed)
# get the hyper-parameter grid
param_grid = self.grid_hp(self.model_name)
if param_grid is not None:
# index of normal ana abnormal samples
idx_a = np.where(y_train==1)[0]
idx_n = np.where(y_train==0)[0]
idx_n = np.random.choice(idx_n, int((len(idx_a) * (1-ratio)) / ratio), replace=True)
idx = np.append(idx_n, idx_a) #combine
np.random.shuffle(idx) #shuffle
# valiation set (and the same anomaly ratio as in the original dataset)
X_val = X_train[idx]
y_val = y_train[idx]

For the unsupervised method DAGMM:

def grid_search(self, X_train, y_train, ratio):
'''
implement the grid search for unsupervised models and return the best hyper-parameters
the ratio could be the ground truth anomaly ratio of input dataset
'''
# set seed
self.utils.set_seed(self.seed)
# get the hyper-parameter grid (n_gmm, default=4)
param_grid = [4, 6, 8, 10]
# index of normal ana abnormal samples
idx_a = np.where(y_train==1)[0]
idx_n = np.where(y_train==0)[0]
idx_n = np.random.choice(idx_n, int((len(idx_a) * (1-ratio)) / ratio), replace=True)
idx = np.append(idx_n, idx_a) #combine
np.random.shuffle(idx) #shuffle
# valiation set (and the same anomaly ratio as in the original dataset)
X_val = X_train[idx]
y_val = y_train[idx]

Although not presented in the paper, we found that using additional labeled anomalies for tuning the hyper-parameters of unsupervised AD algorithms would sightly improve their performances.