trinker/textclean

Unexpected behaviour, replace_word_elongation returning NAs

sdesabbata opened this issue · 4 comments

The replace_word_elongation returns NA values in some unexpected cases.

> library(textclean)
> replace_word_elongation("ooo")
[1] "o"
> replace_word_elongation("Ooo")
[1] NA
> replace_word_elongation("Oo")
[1] "Oo"
> replace_word_elongation("oOo")
[1] NA

Other example

> replace_word_elongation("guinnesss⁠")
[1] NA
> replace_word_elongation("Guinnesss⁠")
[1] NA
> replace_word_elongation("Guinness⁠")
[1] "Guinness⁠"

I have also observed a similar issue with the string "bbbb" but couldn't always replicate it.

> replace_word_elongation("bbbb⁠")
[1] NA
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] textclean_0.9.5

loaded via a namespace (and not attached):
[1] compiler_4.1.1    qdapRegex_0.7.2   tools_4.1.1       glue_1.4.2        stringi_1.7.5     data.table_1.14.2

Thank you for the report. I’ll look into it soon.

There are 2 problems being described above:

  1. A mixed case problem
  2. A non-ascii problem for the guinness example

The mixed case problem has been fixed in version 0.9.6

The guinnesss example has something else going on. It has non-ascii characters that are preventing the elongation fix from happening:

> grepl('[^ -~]', "guinnesss⁠")
[1] TRUE

Currently, the latter is ignored because it doesn't meet the replacement criteria*:

replace_word_elongation(c("Hellloooo", "guinnesss⁠"))
[1] "hello"     "guinnesss⁠"

Note elongation.pattern ("(?i)(^|\\b)\\w*([a-z])(\\1{2,})\\w*($|\\b)") has been pulled out and is defined as:

#' @param elongation.pattern The elongation pattern to search for. The default
#' only considers a repeat of \code{'[A-Za-z]'} within a "word" that is bounded
#' by a word boundary or the beginning or end of the string and contains only
#' \code{'\w'} characters. This means "words" with non-ascii characters will
#' not be considered.

PS The 'bbbb' pattern you show above also contains non-ascii characters grepl('[^ -~]', "bbbb⁠")

Thank you so much for the reply above. I am not entirely clear why the strings return true for non-ascii characters.

Regarding the note above, I was just wondering whether the patter should be "(?i)(^|\\b)\\w*([a-z])(\\2{2,})\\w*($|\\b)" rather than "(?i)(^|\\b)\\w*([a-z])(\\1{2,})\\w*($|\\b)" as the first capturing group is (^|\\b), whereas I guess that should refer to the second capturing group ([a-z]) - or am I misreading it?

> grepl("(^|\\b)\\w*([a-z])(\\1{2,})\\w*($|\\b)", "test")
[1] TRUE
> grepl("(^|\\b)\\w*([a-z])(\\2{2,})\\w*($|\\b)", "test")
[1] FALSE
> grepl("(^|\\b)\\w*([a-z])(\\1{2,})\\w*($|\\b)", "tessst")
[1] TRUE
> grepl("(^|\\b)\\w*([a-z])(\\2{2,})\\w*($|\\b)", "tessst")
[1] TRUE

fixed thank you