Force tagger to not split words made up of numbers and letters
sanchez5674 opened this issue · 1 comments
sanchez5674 commented
Hi,
Is there a way to prevent udpipe from breaking up names made up of numbers and letters? I have sentences that contain company names like 3DS and the POS tagger separates the name into 3: NUM and DS: NOUN.
Thanks for the help.
Carlos
jwijffels commented
If you prefer to use another tokeniser, you can just use another tokenizer. This is shown in
Section 'My text data is already tokenised'
https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html
Just put your tokens in a list (like with the use of strsplit) and you can specify tokenizer = "vertical" or tokenizer = "horizontal"