markvanderloo/stringdist

Parallelization documentation

leonardosnr opened this issue · 3 comments

Dear,

The package is great! Fast and easy to use.

I have a challenge that I believe that parallelization is the best option. Could you share some documentation on the best way to parallelize stringdist?

I have two "big" matrices (700k names each): A and B. I would like to search all names from A in B, and for each row select the best candidates for each row in A (e.g., Levenstein < 5). Crossing both datasets does not seem the best option computationally. I decided to do lapply (using future apply package). I don't know if this is the best option, and I am struggling to find documentation that has problems as big as this one that I described. Could you help?

I would use something as below, but

  1. first measure run-times with smaller datasets, as this might take some time
  2. you might want to use a different distance measure; Jaro(-Winkler), for example, is much faster than levenstein. On my machine the difference is approx. a factor 3.
# Generate some data
n1 <- 1E2
n2 <- 1E4
names1 <- apply(array(sample(letters, 10*n1, replace = TRUE), dim = c(n1, 10)),
  1, paste0, collapse = "" )
names2 <- apply(array(sample(letters, 10*n2, replace = TRUE), dim = c(n2, 10)),
  1, paste0, collapse = "" )


library(stringdist)
library(parallel)

# I find it easier to work with data.frames
names1 <- data.frame(index = seq_along(names1), names = names1)
names2 <- data.frame(index = seq_along(names2), names = names2)

# Start cluster
# Under linux fork cluster is more efficient
nworkers <- 4
cl <- makeCluster(type = "FORK", nworkers)

# Under windows you have to use a psock cluster 
# cl <- makeCluster(type = "PSOCK", nworkers)
# clusterEvalQ(cl, library(stringdist))
# clusterEvalQ(cl, library(data.table))

# Split the first dataset into chunks that will be processed parallel and 
# serially (for memory)
# Here I split names1 into chunks with max 1E4 rows; larger chunks are slightly
# faster but use more memory
nchunks <- max(floor(nrow(names1)/1E4)+1, nworkers)
n1 <- split(names1, seq_len(nrow(names1)) %% nchunks)

res <- parLapply(cl, n1, function(names1, names2, threshold = 1) {
  dta <- expand.grid(x = seq_len(nrow(names1)), y = seq_len(nrow(names2)))
  dta$dist <- stringdist(names1$names[dta$x], names2$names[dta$y])
  dta <- dta[dta$dist <= threshold, ]
  data.frame(index1 = names1$index[dta$x], index2 = names2$index[dta$y], 
    dist = dta$dist)
}, names2 = names2, threshold = 15)

res <- do.call(rbind, res)
res$name1 <- names1$names[res$index1]
res$name2 <- names2$names[res$index2]

Maybe it is good to mention that stringdist is multithreaded, and it scales close to linear with the number of threads. Typically, if your machine has n cores, stringdist will use n-1 of them.

So there is no point in using parallel::makeCluster() on a single machine. If your cluster includes multiple physical machines (or virtual instances) then that would help.

The suggestion of @djvanderlaan to use a faster distance measure, like JW or cosine distance is also a good one. If your data is ASCII, or you are ok with being a little less accurate, you can also gain a bit of speed by setting useBytes=TRUE

Very cool!

Many thanks for sharing it with me. It is very helpful.

@djvanderlaan, I adopt a different approach. I wanted to avoid the cross, which some times might take a lot of time. I am trying to compare yours with mine. It seems it might be faster. Especially if the dataset is very large (my case 700k vs 700k). I combined it (not shown below) with futureapply and I believe I gained some time. I don't understand very well why. Probably it is because I skip the cross (or join).

Here it is my code:

library(stringdist)

# step 0: create / open the data
# Generate some data
n1 <- 1E2
n2 <- 1E4
names1 <- apply(array(sample(letters, 10*n1, replace = TRUE), dim = c(n1, 10)),
                1, paste0, collapse = "" )
names2 <- apply(array(sample(letters, 10*n2, replace = TRUE), dim = c(n2, 10)),
                1, paste0, collapse = "" )

# I find it easier to work with data.frames
names1 <- data.frame(index = seq_along(names1), names = names1)
names2 <- data.frame(index = seq_along(names2), names = names2)

# step 1: select all the names and ids from the source 
name.b <- names2$names
id.b <- names2$index

# step 2 select the rows you want to read in our data
sample.test <- c(1:nrow(names1))

start.time.m <- Sys.time()

# step 3: for each row in the target dataset we will search for useful candidates in the source
db.rec <- lapply(sample.test, function(i){
  # select the name and id we will search for
  name.a <- names1$names[[i]]
  id.a <- names2$index[[i]]
  
  # get the lv measure for each one
  lv <- stringdist::stringdist(name.a, name.b, method = 'lv')
  
  # exclude lv lower than 5
  id.a.r <- id.b[lv < 15] 
  
  # prepare info for matrix
  if(length(id.a.r) == 0){
    size.v <- 1
    id.a.r <- NA
  } else {
    size.v <- length(id.a.r)
  }
  
  # get matrix
  m.r <- matrix(c(rep(id.a, size.v), id.a.r), ncol = 2)
  return(m.r)
})

# step 3: put everything in a data frame
db.rec <- do.call(rbind, db.rec)
db.rec <- as.data.frame(tmp)
total.time.m <- Sys.time() - start.time.m
total.time.m