A collection of sample balancing tools based on variations of SMOTE
Statistical methods that are applied to situations where a category or categories of interest are rare compared to the majority of cases, i. e., imbalanced datasets, can cause difficulty in getting a useful predictive model. For example, if modeling the probability of a rare disease or a rare fraud with a technique like logistic regression, simply classifying all or almost all cases as not having that disease may make the best prediction in terms of the error rate. The simple overall predictive accuracy is not appropriate in these cases. While the statistical assumptions of the model are not violated, assuming that the estimation sample selection is not biased, so inference is still possible, as long as the model is not misspecified, the predictions are not useful. A TREES model can take misclassification costs into account, but LOGISTIC REGRESSION , Neural Nets, and similar procedures cannot do this directly.
While for logistic regression or discriminant analysis one can vary the cutoff probabilities to reflect the possibly larger cost of underpredicting the rare events, this does not affect the estimation process. One could, alternatively, assign case weights, giving the rare cases a larger weight. For example, you could weight cases in each class in inverse proportion to the class size. This will generally produce a model that makes more predictions of the rare events, but the model may still underperform as it is not truly tuned to find the rare cases. Some form of importance weighting reflecting the cost of misclassification errors can also be used to improve the results.
STATS IMBALANCED produces a dataset more balanced than its input so that when models are estimated on it , the predictions for the rare events perform better even though the predictions are biased. These methods are variations on Synthetic Minority Oversampling Technique (SMOTE) algorithms, including also synthetic undersampling or even combining both. Together these are referred to as resampling. These methods improve the balance of the dataset with respect to the target or dependent variable.
Since the new dataset is to some degree artificial, it should not be used for inference. One would generally estimate (train) the model on a training sample and then test it on a holdout sample. Since you can partition the dataset into training and test samples using the standard methods in SPSS Statistics and balance the result, that process is not performed by this procedure.
Here are a few references on SMOTE and similar techniques.
STATS IMBALANCED
DEP = dependent (target) variable*
INDEP = independent variables*
DATASET = dataset name for output dataset*
METHOD = RANDOM or BORDERLINESMOTE or SMOTE or SMOTENC or SMOTEN or
SVMSMOTE or ADASYN or KMEANSSMOTE or
CLUSTERCENTROIDS or RANDOMUNDERSAMPLER or ONESIDEDSELECTION or
EDITEDNEARESTNEIGHBORS or ALLKNN or
SMOTEENN*
(one of the following)
STRATEGY = NOTMAJORITY** or MINORITY or NOTMINORITY or ALL
STRATEGYVAL = number
STRATEGYLIST = list of categories and list of counts
/OPTIONS
ALLOWMINORITY=NO** or YES
BORDERLINEKIND = TYPE1** or TYPE2
KINDSEL = ALLN** or MODE
KNEIGHBORS = integer
MNEIGHBORS = integer
NNEIGHBORS = integer
OUTSTEP = number
REPLACEMENT = NO** or YES
SEED = integer
SHRINKAGE = number
SUMMARIES = NO** or YES
TARGETFREQ = YES** or NO
VOTING = SOFT** or HARD
* Required
** Default
STATS IMBALANCED /HELP displays this help and does nothing else.
STATS IMBALANCED DATASET=z DEP=minority
INDEP=salary salbegin jobtime prevexp
STRATEGY=NOTMAJORITY METHOD=SVMSMOTE
/OPTIONS TARGETFREQ=YES.
STATS IMBALANCED DATASET=z DEP=minority
INDEP=bdate educ jobcat salary salbegin jobtime prevexp
STRATEGY=MINORITY METHOD=RANDOM.
For all methods, there are several choices available for how the cases are sampled and new cases generated. The resampling strategies are as follows.
- STRATEGYVAL: A fractional number: the desired ratio of the number of cases in the minority class to the number of cases in the majority class after resampling. For example, a value of 1 means minority and majority counts are to be equal. This is only applicable when there are exactly two classes.
- STRATEGY:
- minority: resample only the minority class.
- not minority: resample all classes but the minority class
- not majority: resample all classes but the majority class
- all: resample all classes
- STRATEGYLIST: Categories and counts: resample according to a list of category values and counts of the target number of cases. When oversampling, the case count for a category must be at least as large as the number of cases in the input category.
List all the categories first, using quotes for text values where needed, followed by a list of counts. For example,
STRATEGYLIST 1 2 3 100 100 100
. The category values are case sensitive for strings.
Descriptions of the methods are adapted from the documentation for the Imbalance Python library.
Methods
Details on the methods can also be found there. See also the User Guide.
User Guide
Sample balance can be improved by adding cases like ones already in the small group(s) – oversampling, or by removing cases optimally from the majority group(s) – undersampling. These methods can even be combined. There are restrictions, which are noted below, on many of these methods concerning the types of variables that can be used. Some methods cannot be used with string variables except for the target (dependent) variable; some cannot be used with categorical variables, and some can only be used when all variables are categorical. In particular, SMOTE-NC and SMOTE-N handle categorical variables. You can use the SPSS AUTORECODE command to make numerical equivalents to string variables.
Some methods do not accept missing values, whether user or system missing. If they occur, those cases will be ignored for those methods. Missing values in the target variable, however, are never permitted and must be excluded before running this procedure.
Case weights and split files are not supported in this procedure. Note also that it can happen that the requested sampling cannot be achieved due to the properties of the dataset. An error message will be issued in that situation. Target counts and proportions might not be exactly achieved, depending on the nature of the data.
SEED Seed for Random Numbers: This procedure does not use the SPSS random number generators. It uses its own generators. You can specify the starting seed as an integer value if you want a reproducible result. If no seed is specified, the starting value will be, well, random.
TARGETFREQ Display frequencies for target variable: Check to display frequencies for the target variable in the new dataset.
SUMMARIES Display summaries for independent variables: Check to display summary statistics for the non-target variables in the new dataset by the target categories according to the measurement levels.
- RANDOM Random Oversample: oversample the minority class(es) by picking cases at random with replacement.
- BORDERLINESMOTE: variant of the original SMOTE algorithm. Borderline cases will be detected and used to generate new synthetic cases (no missing values)
- SMOTE SMOTE: Synthetic Minority Over-sampling Technique (no strings, no missing values)
- SMOTENC SMOTE-NC: Synthetic Minority Over-sampling Technique for nominal and continuous variables. Unlike SMOTE, it is used for a dataset containing both numerical and categorical features: It requires at least one scale variable and one categorical variable. (no strings, no missing values)
- SMOTEN SMOTE-N: Synthetic Minority Over-sampling Technique for nominal variables. It expects that all the variables are categorical.
- SVMSMOTE SVM SMOTE: Variant of the SMOTE algorithm that uses an SVM algorithm to detect cases to use for generating new synthetic cases. (no strings, no missing values)
- KMEANSSMOTE K Means SMOTE: Apply KMeans clustering before oversampling using SMOTE. (no strings, no missing values)
- ADASYN ADASYN: Oversample using the Adaptive Synthetic (ADASYN) algorithm. This method is similar to SMOTE but generates different numbers of cases depending on an estimate of the local distribution of the class to be oversampled. (no strings,m no missing values)
- SMOTEENN SMOTE with ENN: Combine over- and under-sampling using SMOTE and Edited Nearest Neighbors. (no strings, no missing values)
Following is a list of method parameters that apply only to some of the methods. All of these parameters have default values, which will be reported in the procedure output in most cases. Parameters that do not apply to the chosen method are simply ignored if specified.
- KNEIGHBORS Neighborhood Size for SMOTE, SMOTE-NC, SMOTE-N, SVM SMOTE, and Borderline SMOTE: The number of nearest neighbors used to define the neighborhood of cases to use to generate the synthetic cases.
- BORDERLINEKIND Borderline SMOTE Type: The type of SMOTE algorithm to use. Classify each case to be (i) noise (i.e. all nearest-neighbors are from a different class than the one being classified, (ii) in danger (i.e. at least half of the nearest neighbors are from the same class than the one being classified, or (iii) safe (i.e. other). SMOTE will use the cases in danger to generate new cases. In type1 it will belong to the same class as the one of the case. type2 will consider cases from any class.
- MNEIGHBORS Borderline SMOTE Neighbors: The number of nearest neighbors used to determine if a minority case is in danger.
- KINDSEL Case Exclusions Strategy for All KNN: Strategy to use in order to exclude cases.
- If alln, all neighbors have to agree with the case being classified to not be excluded.
- If mode, the majority vote of the neighbors is used in order to exclude a case.
- NNEIGHBORS Neighborhood Size for All KNN: Size of the neighborhood to consider in computing the nearest neighbors.
- ALLOWMINORITY ALLKNN: Majority/Minority Rule: If YES, allows the majority classes to become the minority class without early stopping.
- VOTING Cluster Centroids Voting: Voting strategy to generate the new cases. If HARD, the nearest neighbors of the centroids found using the clustering algorithm are used. If SOFT, the centroids found by the clustering algorithm will are used.
- SVMSMOTE SVM SMOTE Extrapolation Step Size: The step size when extrapolating.
- REPLACEMENT Random Undersample Sampling Type: Whether the sample is with (YES) or without replacement.
- SHRINKAGE Random Oversample Shrinkage Factor: the shrinkage applied to the covariance matrix. when a smoothed bootstrap is generated. if zero, a normal bootstrap will be generated without perturbation. The shrinkage factor will be used for all classes to generate the smoothed bootstrap.
This command uses the Imbalanced-learn Python library, see Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning, 2017.
© Copyright(C) Jon K. Peck, 2023