Parallelization documentation
leonardosnr opened this issue · 3 comments
Dear,
The package is great! Fast and easy to use.
I have a challenge that I believe that parallelization is the best option. Could you share some documentation on the best way to parallelize stringdist?
I have two "big" matrices (700k names each): A and B. I would like to search all names from A in B, and for each row select the best candidates for each row in A (e.g., Levenstein < 5). Crossing both datasets does not seem the best option computationally. I decided to do lapply (using future apply package). I don't know if this is the best option, and I am struggling to find documentation that has problems as big as this one that I described. Could you help?
I would use something as below, but
- first measure run-times with smaller datasets, as this might take some time
- you might want to use a different distance measure; Jaro(-Winkler), for example, is much faster than levenstein. On my machine the difference is approx. a factor 3.
# Generate some data
n1 <- 1E2
n2 <- 1E4
names1 <- apply(array(sample(letters, 10*n1, replace = TRUE), dim = c(n1, 10)),
1, paste0, collapse = "" )
names2 <- apply(array(sample(letters, 10*n2, replace = TRUE), dim = c(n2, 10)),
1, paste0, collapse = "" )
library(stringdist)
library(parallel)
# I find it easier to work with data.frames
names1 <- data.frame(index = seq_along(names1), names = names1)
names2 <- data.frame(index = seq_along(names2), names = names2)
# Start cluster
# Under linux fork cluster is more efficient
nworkers <- 4
cl <- makeCluster(type = "FORK", nworkers)
# Under windows you have to use a psock cluster
# cl <- makeCluster(type = "PSOCK", nworkers)
# clusterEvalQ(cl, library(stringdist))
# clusterEvalQ(cl, library(data.table))
# Split the first dataset into chunks that will be processed parallel and
# serially (for memory)
# Here I split names1 into chunks with max 1E4 rows; larger chunks are slightly
# faster but use more memory
nchunks <- max(floor(nrow(names1)/1E4)+1, nworkers)
n1 <- split(names1, seq_len(nrow(names1)) %% nchunks)
res <- parLapply(cl, n1, function(names1, names2, threshold = 1) {
dta <- expand.grid(x = seq_len(nrow(names1)), y = seq_len(nrow(names2)))
dta$dist <- stringdist(names1$names[dta$x], names2$names[dta$y])
dta <- dta[dta$dist <= threshold, ]
data.frame(index1 = names1$index[dta$x], index2 = names2$index[dta$y],
dist = dta$dist)
}, names2 = names2, threshold = 15)
res <- do.call(rbind, res)
res$name1 <- names1$names[res$index1]
res$name2 <- names2$names[res$index2]
Maybe it is good to mention that stringdist
is multithreaded, and it scales close to linear with the number of threads. Typically, if your machine has n cores, stringdist
will use n-1 of them.
So there is no point in using parallel::makeCluster()
on a single machine. If your cluster includes multiple physical machines (or virtual instances) then that would help.
The suggestion of @djvanderlaan to use a faster distance measure, like JW or cosine distance is also a good one. If your data is ASCII, or you are ok with being a little less accurate, you can also gain a bit of speed by setting useBytes=TRUE
Very cool!
Many thanks for sharing it with me. It is very helpful.
@djvanderlaan, I adopt a different approach. I wanted to avoid the cross, which some times might take a lot of time. I am trying to compare yours with mine. It seems it might be faster. Especially if the dataset is very large (my case 700k vs 700k). I combined it (not shown below) with futureapply
and I believe I gained some time. I don't understand very well why. Probably it is because I skip the cross (or join).
Here it is my code:
library(stringdist)
# step 0: create / open the data
# Generate some data
n1 <- 1E2
n2 <- 1E4
names1 <- apply(array(sample(letters, 10*n1, replace = TRUE), dim = c(n1, 10)),
1, paste0, collapse = "" )
names2 <- apply(array(sample(letters, 10*n2, replace = TRUE), dim = c(n2, 10)),
1, paste0, collapse = "" )
# I find it easier to work with data.frames
names1 <- data.frame(index = seq_along(names1), names = names1)
names2 <- data.frame(index = seq_along(names2), names = names2)
# step 1: select all the names and ids from the source
name.b <- names2$names
id.b <- names2$index
# step 2 select the rows you want to read in our data
sample.test <- c(1:nrow(names1))
start.time.m <- Sys.time()
# step 3: for each row in the target dataset we will search for useful candidates in the source
db.rec <- lapply(sample.test, function(i){
# select the name and id we will search for
name.a <- names1$names[[i]]
id.a <- names2$index[[i]]
# get the lv measure for each one
lv <- stringdist::stringdist(name.a, name.b, method = 'lv')
# exclude lv lower than 5
id.a.r <- id.b[lv < 15]
# prepare info for matrix
if(length(id.a.r) == 0){
size.v <- 1
id.a.r <- NA
} else {
size.v <- length(id.a.r)
}
# get matrix
m.r <- matrix(c(rep(id.a, size.v), id.a.r), ncol = 2)
return(m.r)
})
# step 3: put everything in a data frame
db.rec <- do.call(rbind, db.rec)
db.rec <- as.data.frame(tmp)
total.time.m <- Sys.time() - start.time.m
total.time.m