passing ratio information in fit() derived from test-dataset

Shall we avoid passing ``ratio=sum(self.data['y_test']) / len(self.data['y_test'])''

Lines 206 to 207 in f3a9e94

    
           self.clf = self.clf.fit(X_train=self.data['X_train'], y_train=self.data['y_train'], 
        
                                   ratio=sum(self.data['y_test']) / len(self.data['y_test']))

@Minqi824 @yzhao062

Thank you for addressing this problem! Actually the ratio is only for the hyper-parameter tuning of the unsupervised methods. Although we use the default hyper-parameter settings in the ADBench paper, we additionally provide the codes for automatically tuning the hyper-parameter based on the labeled anomalies (to construct an additional validation set). Therefore the ratio is necessary for calculating the number of normal samples.

For example, if we have 10 labeled anomalies, we need to provide the anomaly ratio (e.g., 5%), so that [(1-5%) / 5%] * 10 ≈ 190 normal samples are required for constructing a "subset" of the original dataset to evaluate the unsupervised method. The ratio option is ignored for both semi-supervised and fully-supervised algorithms and only present for API consistency by convention.

We agree that using the ratio calculated by the testing set may cause confusion and we have removed this default value for the ratio option. Thanks a lot for this kind advise!

@Minqi824 Thanks for the prompt reply. I have couple of questions:

Does output of 'score_test = self.clf.predict_score(self.data['X_test'])' depends on ratio?
ratio is only for the hyper-parameter tuning of the unsupervised methods --> can you point the code?

@DhavalRepo18 Thanks again for your advices :) !~
For Question 1, the output of anomaly score on the testing set does not depend on the ratio, since we use the AUCROC and AUCPR metrics for evaluating AD algorithms, which do not rely on the specific threshold (or ratio) for calculating the results.

For Question 2, the corresponding codes are as follows:
For the unsupervised methods wrapped in PyOD:

ADBench/baseline/PyOD.py

Lines 91 to 113 in 4040da1

    
               def grid_search(self, X_train, y_train, ratio=None): 
        
                   ''' 
        
                   implement the grid search for unsupervised models and return the best hyper-parameters 
        
                   the ratio could be the ground truth anomaly ratio of input dataset 
        
                   ''' 
        
                   # set seed 
        
                   self.utils.set_seed(self.seed) 
        
                   # get the hyper-parameter grid 
        
                   param_grid = self.grid_hp(self.model_name) 
        
                   if param_grid is not None: 
        
                       # index of normal ana abnormal samples 
        
                       idx_a = np.where(y_train==1)[0] 
        
                       idx_n = np.where(y_train==0)[0] 
        
                       idx_n = np.random.choice(idx_n, int((len(idx_a) * (1-ratio)) / ratio), replace=True) 
        
                       idx = np.append(idx_n, idx_a) #combine 
        
                       np.random.shuffle(idx) #shuffle 
        
                       # valiation set (and the same anomaly ratio as in the original dataset) 
        
                       X_val = X_train[idx] 
        
                       y_val = y_train[idx]

For the unsupervised method DAGMM:

ADBench/baseline/DAGMM/run.py

Lines 39 to 60 in 4040da1

    
               def grid_search(self, X_train, y_train, ratio): 
        
                   ''' 
        
                   implement the grid search for unsupervised models and return the best hyper-parameters 
        
                   the ratio could be the ground truth anomaly ratio of input dataset 
        
                   ''' 
        
                   # set seed 
        
                   self.utils.set_seed(self.seed) 
        
                   # get the hyper-parameter grid (n_gmm, default=4) 
        
                   param_grid = [4, 6, 8, 10] 
        
                   # index of normal ana abnormal samples 
        
                   idx_a = np.where(y_train==1)[0] 
        
                   idx_n = np.where(y_train==0)[0] 
        
                   idx_n = np.random.choice(idx_n, int((len(idx_a) * (1-ratio)) / ratio), replace=True) 
        
                   idx = np.append(idx_n, idx_a) #combine 
        
                   np.random.shuffle(idx) #shuffle 
        
                   # valiation set (and the same anomaly ratio as in the original dataset) 
        
                   X_val = X_train[idx] 
        
                   y_val = y_train[idx]

Although not presented in the paper, we found that using additional labeled anomalies for tuning the hyper-parameters of unsupervised AD algorithms would sightly improve their performances.

	self.clf = self.clf.fit(X_train=self.data['X_train'], y_train=self.data['y_train'],
	ratio=sum(self.data['y_test']) / len(self.data['y_test']))

	def grid_search(self, X_train, y_train, ratio=None):
	'''
	implement the grid search for unsupervised models and return the best hyper-parameters
	the ratio could be the ground truth anomaly ratio of input dataset
	'''

	# set seed
	self.utils.set_seed(self.seed)
	# get the hyper-parameter grid
	param_grid = self.grid_hp(self.model_name)

	if param_grid is not None:
	# index of normal ana abnormal samples
	idx_a = np.where(y_train==1)[0]
	idx_n = np.where(y_train==0)[0]
	idx_n = np.random.choice(idx_n, int((len(idx_a) * (1-ratio)) / ratio), replace=True)

	idx = np.append(idx_n, idx_a) #combine
	np.random.shuffle(idx) #shuffle

	# valiation set (and the same anomaly ratio as in the original dataset)
	X_val = X_train[idx]
	y_val = y_train[idx]