Error in `[.data.table` when using special characters
etiennebacher opened this issue · 2 comments
Hello,
I may have found a bug that was introduced in version 0.8.6 (last version on CRAN at the time of writing). Using special characters generates the following error:
library(udpipe)
library(tm)
# Text data
textData <- data.frame(
doc_id = 1,
text = "tradução"
)
# Download and load model
udModel <- udpipe_download_model(language = "portuguese-gsd",
model_dir = getwd())
udModel <- udpipe_load_model('portuguese-gsd-ud-2.5-191206.udpipe')
# Make a corpus
textCorp <- VCorpus(DataframeSource(textData))
text <- lapply(textCorp, content)
text <- data.frame(doc_id = 1:nrow(textData),
text = unlist(text))
udpipe(text, object = udModel)
Error in `[.data.table`(out, , `:=`(term_id, 1L:.N), by = list(doc_id)) :
Supplied 2 items to be assigned to group 1 of size 0 in column 'term_id'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
In addition: Warning message:
In strsplit(x$conllu, "\n", fixed = TRUE) : input string 1 is invalid UTF-8
The error is generated by the letters "çã" in the text (removing them makes the error disappear). Also, I think this error is generated by the following line in the source code:
Line 254 in fdcc4cc
Removing fixed = TRUE
in the line above removes the error. In case it helps, fixed = TRUE
was introduced in c7557b6.
Session info
- Session info --------------------------------------------------------- setting value version R version 4.1.0 (2021-05-18) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate French_France.1252 ctype French_France.1252 tz Europe/Paris date 2021-10-18
- Packages -------------------------------------------------------------
package * version date lib source
cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.1)
data.table 1.14.2 2021-09-27 [1] standard (@1.14.2)
lattice 0.20-45 2021-09-22 [1] CRAN (R 4.1.1)
Matrix 1.3-4 2021-06-01 [1] CRAN (R 4.1.0)
NLP * 0.2-1 2020-10-14 [1] standard (@0.2-1)
Rcpp 1.0.7 2021-07-07 [1] standard (@1.0.7)
rstudioapi 0.13 2020-11-12 [1] standard (@0.13)
sessioninfo 1.1.1 2018-11-05 [1] standard (@1.1.1)
slam 0.1-48 2020-12-03 [1] standard (@0.1-48)
tm * 0.7-8 2020-11-18 [1] standard (@0.7-8)
udpipe * 0.8.6 2021-06-01 [1] standard (@0.8.6)
withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
[1] C:/Users/etienne/Documents/R/R-4.1.0/library
Best,
What happens if you put your text in utf8 encoding as indicated in the help.
Indeed using text = enc2utf8("tradução")
works. Thanks, and sorry for the inconvenience