bnosac/udpipe

Force tagger to not split words made up of numbers and letters

sanchez5674 opened this issue · 1 comments

Hi,

Is there a way to prevent udpipe from breaking up names made up of numbers and letters? I have sentences that contain company names like 3DS and the POS tagger separates the name into 3: NUM and DS: NOUN.

Thanks for the help.

Carlos

If you prefer to use another tokeniser, you can just use another tokenizer. This is shown in
Section 'My text data is already tokenised'
https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html
Just put your tokens in a list (like with the use of strsplit) and you can specify tokenizer = "vertical" or tokenizer = "horizontal"