KNN classifier on BBC News Categories
This work aims to build a News classifier, to identify News from 5 categories: business, entertainment, politics, sport and tech. We will perform a knn predictive analysis with the class package along text preprocessings using the tm package. The classifier is built upon 2225 BBC News Datasets from 2005-2006. Datasets can be found under the folder bbc-fulltext.
R Packages used:
- tm: Text-Mining Package
- plyr: Tools for Splitting, Applying and Combining Data
- class: Functions for Classification (knn)
Sources:
How to Build a Text Mining, Machine Learning Document Classification System in R!
Package ‘tm’
The caret Package
Package ‘class’
- Load libraries
libs <- c("tm", "plyr", "class", "e1071")
lapply(libs, require, character.only = TRUE)
- Set options - do not read strings as factors
options(stringAsFactors = FALSE)
- Define 5 categories & specify the path of data
categories <- c("business", "entertainment",
"politics", "sport", "tech")
pathname <- "../bbc-fulltext"
Create cleanCorpus() to clean the text data:
- Remove punctuation
- Remove white space
- Convert all words to lowercase
- Remove words - stopwords() specifies the group of predefined stopwords to be removed
i.e. stopwords("English") = the, is, at, which, and on
cleanCorpus <- function(corpus) {
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, stripWhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus.tmp, removeWords, stopwords("english"))
return(corpus.tmp)
}
A Term-Document Matrix describes the frequency of terms in the collection of documents.
- Read in data from prespecified path
- Apply cleanCorpus() created from the previous step
- Apply TermDocumentMatrix() to create the resulting term-document matrices
- Sparsity specifies the percentage of emptiness a word occurs in documents.
i.e. a term occurring 0 times in 70/100 documents = the term has a sparsity of 0.7
We wish to remove words with sparsity > 0.7.
i.e. The resulting matrix retains words occuring in 30% of documents or more.
Create an object tdm - note that several paths (articles) are being read in.
generateTDM <- function(cate, path) {
s.dir <- sprintf("%s/%s", path, cate)
s.cor <- Corpus(DirSource(directory = s.dir))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <- TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.7) # setting sparsity threshold
result <- list(name = cate, tdm = s.tdm)
}
tdm <- lapply(categories, generateTDM, path = pathname)
Create binCategoryToTDM() and bind predefined categories to the existing TDM:
- Coerce TDM from a tdm object to matrix & transpose
- Coerce the transposed matrices into data frames
- Bind the resulting data frames with the name of each element of tdm
(Encouraged: Print out tdm produced from above to see how it looks like!) - Rename using the predefined names of categories
bindCategoryToTDM <- function(tdm) {
s.mat <- t(data.matrix(tdm[["tdm"]]))
s.df <- as.data.frame(s.mat, stringsAsFactors = F)
s.df <- cbind(s.df, rep(tdm[["name"]], nrow(s.df)))
colnames(s.df)[ncol(s.df)] <- "targetcategory"
return(s.df)
}
cateTDM <- lapply(tdm, bindCategoryToTDM)
Create a final data frame for analysis.
- Bind the tdm data frames into one stack row by row.
- Assign 0 to missing values.
tdm.stack <- do.call(rbind.fill, cateTDM)
tdm.stack[is.na(tdm.stack)] <- 0
Create random samples (indicies) for training and test sets. (0.7:0.3)
train.idx <- sample(nrow(tdm.stack), ceiling(nrow(tdm.stack) * 0.7))
test.idx <- (1:nrow(tdm.stack)) [- train.idx]
We will add two more steps of preprocessing before we start feeding the data (news article) into our models.
tdm.cate <- tdm.stack[, "targetcategory"]
tdm.stack.nl <- tdm.stack[, !colnames(tdm.stack) %in% "targetcategory"]
- tdm.cate stores the column of the target category (including both training and test sets.)
- tdm.stack.nl stores the rest of the columns.
A quick break down of !colnames(tdm.stack) %in% "targetcategory": colnames(tdm.stack) returns all the column names of the tdm.stack. While %in% is the operator of searching, %in% "targetcategory" searches for and return the column. We want the columns that are NOT "targetcategory" that we inverse the result with an ! operator.
These are prepared to forward into the next part:
With a K-Nearest-Neighbour (knn) model, each data point (the news article we intend to pair up with a category) search for the nearest k neighbouring data points. The category receiving the most votes from the k data points will be the final winning class (assigned to the news article.) 3 main arguments will be required to feed into the knn() function.
- training = matrix (or data frame) of the training set cases, without the category being specified.
- test = matrix (or data frame) of the test set cases.
- cl = true classifications, or the true category, of the training set cases.
And a couple of optional arguments.
- k = the number of neighbours considered. (default k = 1)
- l = minimum number of votes gained for definite decision. (default l = 0)
- prob = if set TRUE, the proportion of votes (for the winning class) will be returned. (default prob = FALSE)
Now we feed the dataset required into the model. Compare the predictions and the actual values.
knn.pred <- knn(train = tdm.stack.nl[train.idx, ],
test = tdm.stack.nl[test.idx, ],
cl = tdm.cate[train.idx]),
k = 1)
knn.mat <- table("Predictions" = knn.pred, "Actual" = tdm.cate[test.idx])
knn.mat
## Actual
## Predictions business entertainment politics sport tech
## business 153 12 12 10 6
## entertainment 7 102 3 7 5
## politics 2 0 101 0 9
## sport 2 5 6 131 3
## tech 0 0 0 0 91
It looks pretty good so far - the main diagonal gives us the number of correct prediction for each category.
knn.acc <- sum(diag(knn.mat))/sum(knn.mat)
knn.acc
## [1] 0.8665667
Which indicate the proportion of the correct predictions = 0.87
Seems like we've done a pretty good prediction!