bnosac/udpipe

Error in `[.data.table` when using special characters

etiennebacher opened this issue · 2 comments

Hello,

I may have found a bug that was introduced in version 0.8.6 (last version on CRAN at the time of writing). Using special characters generates the following error:

library(udpipe)
library(tm)

# Text data
textData <- data.frame(
  doc_id = 1,
  text = "tradução"
)

# Download and load model
udModel <- udpipe_download_model(language  = "portuguese-gsd", 
                                 model_dir = getwd())

udModel <- udpipe_load_model('portuguese-gsd-ud-2.5-191206.udpipe')

# Make a corpus 
textCorp <- VCorpus(DataframeSource(textData))
text     <- lapply(textCorp, content)


text <- data.frame(doc_id = 1:nrow(textData), 
                   text   = unlist(text))

udpipe(text, object = udModel)
Error in `[.data.table`(out, , `:=`(term_id, 1L:.N), by = list(doc_id)) : 
  Supplied 2 items to be assigned to group 1 of size 0 in column 'term_id'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
In addition: Warning message:
In strsplit(x$conllu, "\n", fixed = TRUE) : input string 1 is invalid UTF-8

The error is generated by the letters "çã" in the text (removing them makes the error disappear). Also, I think this error is generated by the following line in the source code:

txt <- strsplit(x$conllu, "\n", fixed = TRUE)[[1]]

Removing fixed = TRUE in the line above removes the error. In case it helps, fixed = TRUE was introduced in c7557b6.

Session info
- Session info ---------------------------------------------------------
 setting  value                       
 version  R version 4.1.0 (2021-05-18)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  French_France.1252          
 ctype    French_France.1252          
 tz       Europe/Paris                
 date     2021-10-18                  
  • Packages -------------------------------------------------------------
    package * version date lib source
    cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.1)
    data.table 1.14.2 2021-09-27 [1] standard (@1.14.2)
    lattice 0.20-45 2021-09-22 [1] CRAN (R 4.1.1)
    Matrix 1.3-4 2021-06-01 [1] CRAN (R 4.1.0)
    NLP * 0.2-1 2020-10-14 [1] standard (@0.2-1)
    Rcpp 1.0.7 2021-07-07 [1] standard (@1.0.7)
    rstudioapi 0.13 2020-11-12 [1] standard (@0.13)
    sessioninfo 1.1.1 2018-11-05 [1] standard (@1.1.1)
    slam 0.1-48 2020-12-03 [1] standard (@0.1-48)
    tm * 0.7-8 2020-11-18 [1] standard (@0.7-8)
    udpipe * 0.8.6 2021-06-01 [1] standard (@0.8.6)
    withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
    xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)

[1] C:/Users/etienne/Documents/R/R-4.1.0/library

Best,

What happens if you put your text in utf8 encoding as indicated in the help.

Indeed using text = enc2utf8("tradução") works. Thanks, and sorry for the inconvenience