markvanderloo/stringdist

stringsimmatrix warning with odd length vectors when using method = 'dl'

Closed this issue · 1 comments

Hey folks,

stringsimmatrix() seems to be doing something wrong when using a target vector of odd length and method = 'dl'

n = 7

source_accounts <- c('aaa','bba','aba','eee','eef','ege','egegeg','gegegegeg','gagagagadg') %>% as.data.frame() %>% rename(., Account_lower = .)

target_accounts <- c('aaa','aba','dda') %>% as.data.frame() %>% rename(., name = .)

source_vector <- source_accounts %>%
 select(Account_lower) %>% 
 distinct() %>%
 head(n) %>%
 pull()

print(length(source_vector))

target_vector = tolower(target_accounts %>%
                         select(name) %>%
                        head(2) %>%
                       pull())

print(length(target_vector))

sims <- stringsimmatrix(target_vector, source_vector, method = method)

The above code returns a warning, although the resulting similarities are not correct. Compare to a run where n = 6.
Results below

n = 7

Warning message:
In pmax(lengths(a, type = nctype), lengths(b, type = nctype)) :
  an argument will be fractionally recycled
> sims
          [,1]      [,2]      [,3] [,4] [,5] [,6] [,7]
[1,] 1.0000000 0.3333333 0.6666667  0.5    0    0    0
[2,] 0.6666667 0.6666667 1.0000000  0.0    0    0    0
> sims <- stringsimmatrix(target_vector, source_vector, method = method)
> sims
          [,1]      [,2]      [,3] [,4] [,5] [,6]
[1,] 1.0000000 0.3333333 0.6666667    0    0    0
[2,] 0.6666667 0.6666667 1.0000000    0    0    0

Notice the difference in row 1, column 4.

This could be related to #88 which has been solved in the development version.

Simplifying and repeating your example with the current development version, I get the following.

> target_vector <- c("aaa","aba")
> source_vector <- c("aaa","bba","aba","eee","eef", "ege", "egegeg")
> stringsimmatrix(target_vector, source_vector, method="dl")
          [,1]      [,2]      [,3] [,4] [,5] [,6] [,7]
[1,] 1.0000000 0.3333333 0.6666667    0    0    0    0
[2,] 0.6666667 0.6666667 1.0000000    0    0    0    0
> packageVersion("stringdist")
[1] ‘0.9.6.3.5

I am about to release a new version that should solve the problem.