bnosac/crfsuite

Difference in start & end position between shiny app and udpipe()

DuraQ opened this issue · 3 comments

DuraQ commented

Great package, first of all !

I'm using the Shiny app to annotate custom entities in text. I have a very simple example where I load the text, and tag the a monetary amount in the text :

image

I then run the exact same piece of text through udpipe for automatic POS tagging. However, I'm unable to successfully join the udpipe output with the output from the Shiny app, because there is a difference in the start and end positions.

udpipe extarcts that "100,000,000.00" as a token, but with start and end position 1539-1552.

nchar(text) returns 1,839 characters, but the maximum end value in the dataframe returned by udpipe is 1911. So for some reason udpipe has 72 additional characters that are not present (or visible ?) in the text.

I have verified that the "text" and "text_visible" are the exact same
I have verified that the text is UTF-8 encoded before loading in the shiny app and before inputting into udpipe.
My versions of shiny and flexdashboard are 1.05 and 0.5.1.1, respectively.
I ran this on both Win 10 and Linux.

Any idea what might cause this ?

Which version of crfsuite to annotate are you using. Is that 0.3 or is it a previous version?
Can you give the text that you've annotated as well as the annotation you did? If needed, send me as a private message to the maintainer shown in https://github.com/bnosac/crfsuite/blob/master/DESCRIPTION

DuraQ commented

I'm using crfsuite version 0.3. I'll send you a private message with some sample data.

Probably caused by having leading and trailing spaces in your text. Remove these before using the shiny app and doing the annotation. as in trimws(yourtext)
Issue reported at bnosac/udpipe#54