/nearest_neighbor_classifiers

Nearest neighbor classifiers written in R

Primary LanguageHTML

Similarity-based classifiers. Nearest neighbor classifiers

Similarity-based classifiers estimate the class label of a test sample based on the similarities between the test sample and a set of labeled training samples, and the pairwise similarities between the training samples.

Similarity functions ρ(x;y) may be asymmetric and fail to satisfy the other mathematical properties required for metrics or inner products. A popular approach to similarity-based classification is to treat the given dissimilarities as distances in some Euclidean space.

Nearest neighbor classifiers take training sample, object u that needs to be classified and optionally some parameters for weight funcion as input and output label (class) predicted for u. Algorithm sorts trainig sample by distance (similarity) to classified object in ascending order (so objects from trainig set with less i are closer to u). Object label is found as an argument that maximizes the sum of weight functions:

nn

As calculations are delayed until u is known nearest neighbor classifier is considered a lazy learning method. Varying the weight function different similarity-based classifiers can be obtained.

Some of them are observed in this research:

  1. k nearest neighbor algorithm
  • k weighted nearest neighbor algorithm
  1. Parzen window algorithm
  • Parzen window algorithm with variable window width
  1. Potential funcion algorithm

Algorithms are tested on standard R iris data set (Fisher's Iris data set). Plots are presented in petal.length, petal.width coordinate space

Data compression methods are addressed by the example of STOLP algorithm. At the end presented algorithms are compared in terms of generalization performance.

k nearest neighbor algorithm

Parameter k is introduced and weight function looks like this: ω(i, u) = [i <= k]. It means it returns 1 if i is less or equal to k and 0 otherwise. In other words u gets the label of majority among its k nearest neigbors:

kNN

Listing of kNNClassifier funciton written in R:

# nearest neighbor classifers main params:
# trainSet - data frame with rows representing training objects
# each column represents some numerical feature, last column is label factor
# u - object to classify
# returns c(label, margin)
kNNClassifier <- function(trainSet, u, metric = dst1, k){
  rowsNum <- dim(trainSet)[1]; varsNum <- dim(trainSet)[2]-1
  labels = levels(trainSet[rowsNum, varsNum+1])
  orderedDist <- sortByDist(trainSet, u, rowsNum, varsNum, metric)
  kNNeighbors <- orderedDist[1:k,1]
  lst <- orderedDist[k,2]
  countTable <- table(kNNeighbors)
  maxLabelIdx = which.max(countTable)
  return(c(labels[maxLabelIdx], abs(max(countTable)-max(countTable[-maxLabelIdx]))))
}

Parameter k is found for particular test sample through empirical risk minimization by using LOO cross-validation.

LOO - (leave one out) is a procedure of empirical evaluation of classification algorithm quality. Sample is being divided into a test set consisting of a single element and a training set comprising all other elements for every element in this sample. It outputs the sum of errors (wrong classification cases) divided by total number of tests (number of elements in the sample).

Here is R implementation:

LOO <- function(classifier, classifierArgs, dataSet, paramName, paramRange){
  rows <- dim(dataSet)[1]
  cols <- dim(dataSet)[2]-1
  risks <- vector('numeric', length(paramRange))
  minErr <- rows
  bestParam = 0; j=1; paramIdx = 1
  for(param in paramRange){
    err = 0
    for(i in 1:rows){
      currentArgs = c(list(dataSet[-i, ], dataSet[i, 1:cols]),classifierArgs)
      currentArgs[paramName] = param
      class <- do.call(classifier, currentArgs)[1]
      if(class != dataSet[i,cols+1])
        err <- err+1
    }
    if(err < minErr){
      minErr = err
      bestParam = param
      paramIdx = j
    }
    risks[j] = err/rows
    j = j+1
    message(paste("current ", paramName, " = ", sep=''), param, " out of ", paramRange[length(paramRange)])
  }
  message(paste("best ", paramName, ": ", sep=''), bestParam)
  message("errors made: ", minErr)
  message("empirical risk: ", minErr/rows)
  plot(paramRange, risks, xlab=paramName, ylab='LOO', col='red', type='l')
  points(bestParam, risks[paramIdx], bg='red', col='red', pch=21)
  text(bestParam, risks[paramIdx], paste(paramName,"=",bestParam), pos=3)
  return(bestParam)
}

Here is optimal k for iris data set (only sepal length and width considered) found by LOO:

LOOkNN2

Classification of arbitrary objects map (k=6):

kNNk6

And that's optimal k all features considered:

LOOkNN4

k weighted nearest neighbor algorithm

kwNN

ω(i) is a monotonically decreasing sequence of real-number weights, specifying contribution of i-th neighbor in u classification.

Such sequence can be obtained as geometric sequence with scale factor q<1. Where optimal q is found by minimizing empirical risk with LOO.

kWNNLOO2 kWNNq06k6

Parzen window algorithm

Let' s define ω(i, u) as a function of distance rather than neighbor rank.

parzenw

where K(z) is nonincreasing on [0, ∞) kernel function. Then our metric classifier will look like this:

parzen

This is how kernel functions graphs look like: kernels

Here are some of them implemented in R:

ker1 <- (function(r) max(0.75*(1-r*r),0)) # epanechnikov
ker2 <- (function(r) max(0,9375*(1-r*r),0)) # quartic
ker3 <- (function(r) max(1-abs(r),0)) # triangle
ker4 <- (function(r) ((2*pi)^(-0.5))*exp(-0.5*r*r)) # gaussian
ker5 <- (function(r) ifelse(abs(r)<=1, 0.5, 0)) # uniform

Parameter h is called "window width" and is similar to number of neighbors k in kNN. Here optimal h is found using LOO (epanechnikov kernel, sepal.width & sepal.length only):

LOOker1parzen2

all four features:

LOOker1parzen4

triangle kernel, h=0.4:

h04ker3parzen2

gaussian kernel, h=0.1:

h01ker4parzen2

Parzen window algorithm can be modified to suit case-based reasoning better. It's what we call parzen window algorithm with variable window width. Let h be equal to the distance to k+1 nearest neighbor. Here is comparison of parzen window classifier (uniform kernel) without and with variable window width modification applied: parzenKer5 parzenKer5Var

Potential function algorithm

It's just a slight modification of parzen window algorithm:

potential

Now window width h depends on training object x. γ represents importance of each such object.

All these 2l params can be obtained/adjusted with the following procedure:

potentialParamsFinder <- function(dataSet){
  n = dim(dataSet)[1]; m = dim(dataSet)[2]-1
  params = cbind(rep(1,n), rep(0,n))
  for(i in 1:n){
    repeat{
      res = potentialClassifier(dataSet, dataSet[i, 1:m], params)[1]
      if(res == dataSet[i,m+1]) break
      params[i,2] = params[i,2] + 1
    }
  }
  return(params)
}

Though this process on a sufficiently big sample can take considerable amount of time. Here is the result of found and applyied params:

potential

STOLP algorithm

This algorithm implements data set compression by finding regular (etalon) objects (stolps) and removing objects which do not or harmfully affect classification from sample. To explain how this algorithm works idea of margin has to be introduced.

Margin of an object (M(x)) shows us, how deeply current object lies within its class. margin

Here is R implementation of STOLP algorithm:

STOLP <- function(set, threshold, classifier, argsList){
  plot(iris[,3:4], bg=colors2[iris$Species], col=colors2[iris$Species])
  rowsNum <- dim(set)[1]
  varsNum <- dim(set)[2]-1
  toDelete = numeric()
  for(i in 1:rowsNum){
    currentArgs = c(list(set, set[i, 1:varsNum]),argsList)
    res = do.call(classifier,currentArgs)
    if(res[1] != set[i, varsNum+1]){
      toDelete <- c(toDelete, i)
    }
  }
  points(set[toDelete,], pch=21, bg='grey', col='grey') # debug
  set = set[-toDelete, ]; rowsNum = rowsNum - length(toDelete)
  labels = levels(set[rowsNum,varsNum+1])
  maxRes = rep(0, length(labels)); names(maxRes)<-labels
  maxLabel = rep(0, length(labels)); names(maxLabel)<-labels
  for(i in 1:rowsNum){
    currentArgs = c(list(set, set[i, 1:varsNum]),argsList)
    res = do.call(classifier,currentArgs)
    if(res[2] > maxRes[res[1]]){
      maxRes[res[1]] = res[2]
      maxLabel[res[1]] = i
    }
  }
  regular = set[maxLabel, ]
  points(regular, pch=21, bg=colors2[regular$Species], col=colors2[regular$Species])
  repeat{
    errCount = 0L; toAdd = 0L; maxAbsMargin = -1
    for(i in 1:rowsNum){
      currentArgs = c(list(regular, set[i, 1:varsNum]),argsList)
      res = do.call(classifier,currentArgs)
      if(res[1] != set[i, varsNum+1]){
        errCount = errCount + 1
        if(as.double(res[2]) > maxAbsMargin)
          toAdd <- i
          maxAbsMargin <- as.double(res[2])
      }
    }
    if(errCount <= threshold)
      return(regular)
    newRegular = set[toAdd,]
    regular = rbind(regular,newRegular)
    points(newRegular, pch=21, bg=colors2[newRegular$Species], col=colors2[newRegular$Species])
  }
}

Here are STOLP compression results for kNN, kWNN, parzen windowss algorithms respectively:

stolpKnn stolpKwnn stolpParzen

Conclusion

Here is a summary of results obtained by LOO for different algorithms on iris data set:

algorithm best parameter errors empirical risk
kNN (4 features) best k: 19 3 0.02
kNN (2 features) best k: 6 5 0.0(3)
parzen window (epanechnikov kernel, 4 features) best h: 0.8 5 0.0(3)
parzen window (gaussian kernel) best h: 0.02 5 0.0(3)
parzen window /w variable window (uniform kernel) best k: 1 5 0.0(3)
parzen window /w variable window (gaussian kernel) best k: 1 6 0.04
parzen window (first 3 kernels, 2 features) best h: 0.4 6 0.04
kWNN best q: 0.6 6 0.04