stringsimmatrix warning with odd length vectors when using method = 'dl'
Closed this issue · 1 comments
mbconnor commented
Hey folks,
stringsimmatrix() seems to be doing something wrong when using a target vector of odd length and method = 'dl'
n = 7
source_accounts <- c('aaa','bba','aba','eee','eef','ege','egegeg','gegegegeg','gagagagadg') %>% as.data.frame() %>% rename(., Account_lower = .)
target_accounts <- c('aaa','aba','dda') %>% as.data.frame() %>% rename(., name = .)
source_vector <- source_accounts %>%
select(Account_lower) %>%
distinct() %>%
head(n) %>%
pull()
print(length(source_vector))
target_vector = tolower(target_accounts %>%
select(name) %>%
head(2) %>%
pull())
print(length(target_vector))
sims <- stringsimmatrix(target_vector, source_vector, method = method)
The above code returns a warning, although the resulting similarities are not correct. Compare to a run where n = 6.
Results below
n = 7
Warning message:
In pmax(lengths(a, type = nctype), lengths(b, type = nctype)) :
an argument will be fractionally recycled
> sims
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1.0000000 0.3333333 0.6666667 0.5 0 0 0
[2,] 0.6666667 0.6666667 1.0000000 0.0 0 0 0
> sims <- stringsimmatrix(target_vector, source_vector, method = method)
> sims
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1.0000000 0.3333333 0.6666667 0 0 0
[2,] 0.6666667 0.6666667 1.0000000 0 0 0
Notice the difference in row 1, column 4.
markvanderloo commented
This could be related to #88 which has been solved in the development version.
Simplifying and repeating your example with the current development version, I get the following.
> target_vector <- c("aaa","aba")
> source_vector <- c("aaa","bba","aba","eee","eef", "ege", "egegeg")
> stringsimmatrix(target_vector, source_vector, method="dl")
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1.0000000 0.3333333 0.6666667 0 0 0 0
[2,] 0.6666667 0.6666667 1.0000000 0 0 0 0
> packageVersion("stringdist")
[1] ‘0.9.6.3.5’
I am about to release a new version that should solve the problem.