Unexpected behaviour, replace_word_elongation returning NAs
sdesabbata opened this issue · 4 comments
The replace_word_elongation
returns NA
values in some unexpected cases.
> library(textclean)
> replace_word_elongation("ooo")
[1] "o"
> replace_word_elongation("Ooo")
[1] NA
> replace_word_elongation("Oo")
[1] "Oo"
> replace_word_elongation("oOo")
[1] NA
Other example
> replace_word_elongation("guinnesss")
[1] NA
> replace_word_elongation("Guinnesss")
[1] NA
> replace_word_elongation("Guinness")
[1] "Guinness"
I have also observed a similar issue with the string "bbbb"
but couldn't always replicate it.
> replace_word_elongation("bbbb")
[1] NA
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] textclean_0.9.5
loaded via a namespace (and not attached):
[1] compiler_4.1.1 qdapRegex_0.7.2 tools_4.1.1 glue_1.4.2 stringi_1.7.5 data.table_1.14.2
Thank you for the report. I’ll look into it soon.
There are 2 problems being described above:
- A mixed case problem
- A non-ascii problem for the guinness example
The mixed case problem has been fixed in version 0.9.6
The guinnesss example has something else going on. It has non-ascii characters that are preventing the elongation fix from happening:
> grepl('[^ -~]', "guinnesss")
[1] TRUE
Currently, the latter is ignored because it doesn't meet the replacement criteria*:
replace_word_elongation(c("Hellloooo", "guinnesss"))
[1] "hello" "guinnesss"
Note elongation.pattern
("(?i)(^|\\b)\\w*([a-z])(\\1{2,})\\w*($|\\b)"
) has been pulled out and is defined as:
#' @param elongation.pattern The elongation pattern to search for. The default
#' only considers a repeat of \code{'[A-Za-z]'} within a "word" that is bounded
#' by a word boundary or the beginning or end of the string and contains only
#' \code{'\w'} characters. This means "words" with non-ascii characters will
#' not be considered.
PS The 'bbbb' pattern you show above also contains non-ascii characters grepl('[^ -~]', "bbbb")
Thank you so much for the reply above. I am not entirely clear why the strings return true for non-ascii characters.
Regarding the note above, I was just wondering whether the patter should be "(?i)(^|\\b)\\w*([a-z])(\\2{2,})\\w*($|\\b)"
rather than "(?i)(^|\\b)\\w*([a-z])(\\1{2,})\\w*($|\\b)"
as the first capturing group is (^|\\b)
, whereas I guess that should refer to the second capturing group ([a-z])
- or am I misreading it?
> grepl("(^|\\b)\\w*([a-z])(\\1{2,})\\w*($|\\b)", "test")
[1] TRUE
> grepl("(^|\\b)\\w*([a-z])(\\2{2,})\\w*($|\\b)", "test")
[1] FALSE
> grepl("(^|\\b)\\w*([a-z])(\\1{2,})\\w*($|\\b)", "tessst")
[1] TRUE
> grepl("(^|\\b)\\w*([a-z])(\\2{2,})\\w*($|\\b)", "tessst")
[1] TRUE
fixed thank you