match.matrix(p.188)
Closed this issue · 1 comments
p.188(last sentence).
The textbook says "You do not need to specify the third parameter, original.matrix, since train.dtm is the original matrix." But when I try the following code:
train.dtm <- match.matrix(clean.train, weighting = tm::weightTfIdf)
I got the following error message:
Error in match.matrix(clean.train, weighting = tm::weightTfIdf) : object 'original.martix' not found
In addition: Warning message:
In weighting(x) : empty document(s): 409
Would you please try the code to advise me how to fix it? I attatchd the code for match.matrix as follows:
#------------ End of an issue
Thanks in advance
Chang-Kyo Suh(ck@knu.ac.kr)
Looks like some errata on my part. Thx for sharing.
Here is a cleaner version that functions.
RCurl
is only needed to get the data directly from github. Otherwise you won't need it.
# Libs
require(RCurl)
require(tm)
Ensure the data is considered strings
# Options
options(stringsAsFactors = F)
Fetch the data from github or you could just use read.csv
if you have it locally.
# Data
headlines<-read.csv(text=getURL("https://raw.githubusercontent.com/kwartler/text_mining/master/all_3k_headlines.csv"))
Here is a generic cleaning function, you can adjust as needed.
# Custom Function
headline.clean<-function(x){
x<-tolower(x)
x<-removeWords(x,stopwords('en'))
x<-removePunctuation(x)
x<-stripWhitespace(x)
return(x)
}
When importing from raw git, sometimes there are special characters. You could add this function to the cleaning function if needed instead but I kept it separate for this simple demonstration.
# Some special characters can cause issues (could be part of the clean function)
headlines$headline<-gsub("[^[:graph:]]", " ",headlines$headline)
Apply the function to clean all the data. In a real world scenario you could do this to any new text data coming in but here I apply it to the entire corpus prior to partitioning.
# Apply cleanning function
clean.train<-headline.clean(headlines$headline)
I used sample
but you can use any partitioning schema e.g. from caret
like in the book.
# Quick Partitioning
train<-sample(1:nrow(headlines),2500, replace=F)
train.headlines<-clean.train[train]
test.headlines<-clean.train[-train]
Remember to change your source and you can use getSources()
to see a list of available sources. Also in the new tm
package you have to remember that readTabular was deprecated in favor of the data frame source.
# Make a VCorpus
train.corp<-VCorpus(VectorSource(train.headlines))
# Construct a DTM
train.dtm<-DocumentTermMatrix(train.corp)
Here is the revised matchMatrix()
function. This time I made it more straight forward and it MUST have an original DTM to work from. The wgt
parameter is a string for the DTM term weight. It defaults to term frequency but needs to match the original. The function accepts a vector of new text, the original DTM and the weight inputs.
matchMatrix<-function(textVec,originalDTM,wgt='weightTf'){
# One last cleaning to make sure it works
newTxt <- sapply(as.vector(textVec, mode = "character"),
iconv, to = "UTF8", sub = "byte")
# Make a Test Set Corpus
newCorpus <- VCorpus(VectorSource(newTxt))
# Make the Test Set DTM
ctrl<-list(wgt)
mat <- DocumentTermMatrix(newCorpus, control = ctrl)
# Find differing terms
emptyTerms<-setdiff(colnames(originalDTM) ,colnames(mat))
# Check wgt
if (attr(originalDTM, "weighting")[2] == "tfidf"){
weight <- 0.000000001
} else {
weight <- 0
}
# Construct empty cols
emptyMat <- matrix(weight, nrow = nrow(mat), ncol = length(emptyTerms))
# Add names
colnames(emptyMat) <- emptyTerms
rownames(emptyMat) <- rownames(mat)
# Find common terms
commonTerms<-colnames(mat)[colnames(mat) %in% colnames(originalDTM)]
# Append the original data
joinDTM<-cbind(emptyMat,mat[,commonTerms])
joinDTM<-as.DocumentTermMatrix(joinDTM,weighting = wgt)
#Re-order
joinDTM<-joinDTM[,sort(colnames(originalDTM))]
# Response
return(joinDTM)
}
Here you apply the function with the needed info.
testDTM<-matchMatrix(textVec=test.headlines, originalDTM = train.dtm,wgt='weightTf')
You can check the column names and dimensions below.
# Check
head(train.dtm$dimnames$Terms)
head(testDTM$dimnames$Terms)
dim(train.dtm)
dim(testDTM)
In this example, the new DTM has 500 documents, and the original 2500. The number of terms should be the same along with the colnames themselves. In this way you can prepare a new set of text for modeling and analysis.