note on spacesafter

Question

note on spacesafter

jwijffels opened this issue 5 years ago · 5 comments

from here https://ufal.mff.cuni.cz/udpipe/users-manual

Basically this means the misc field can have SpacesBefore=/SpacesAfter=/SpacesInToken=
with the following possible values

\s: space
\t: tab
\r: CR character
\n: LF character
\p: | (pipe character)
\: \ (backslash character)

You can see that in e.g.

> library(udpipe)
> x <- udpipe(" .It remains all spaces. You see\n\n\n. We started a new paragraph.", "english")
> x[, c("doc_id", "paragraph_id", "sentence_id", "term_id", "token", "misc")]
   doc_id paragraph_id sentence_id term_id     token                           misc
1    doc1            1           1       1         . SpacesBefore=\\s|SpaceAfter=No
2    doc1            1           2       2        It                           <NA>
3    doc1            1           2       3   remains                           <NA>
4    doc1            1           2       4       all                           <NA>
5    doc1            1           2       5    spaces                  SpaceAfter=No
6    doc1            1           2       6         .                           <NA>
7    doc1            1           3       7       You                           <NA>
8    doc1            1           3       8       see          SpacesAfter=\\n\\n\\n
9    doc1            2           4       9         .                           <NA>
10   doc1            2           5      10        We                           <NA>
11   doc1            2           5      11   started                           <NA>
12   doc1            2           5      12         a                           <NA>
13   doc1            2           5      13       new                           <NA>
14   doc1            2           5      14 paragraph                  SpaceAfter=No
15   doc1            2           5      15         .                SpacesAfter=\\n

Except the last (bnosac/udpipe#27), this is because of a bug in the R package I maintain at bnosac/udpipe#27 which I still need to fix

By default, UDPipe uses custom MISC fields to store all spaces in the original document. This markup is backward compatible with CoNLL-U v2 SpaceAfter=No feature. This markup can be utilized by the plaintext output format, which allows reconstructing the original document.

Note that in theory not only spaces, but also other original content can be saved in this way (for example XML tags if the input was encoded in a XML file).

The markup uses the following MISC fields on tokens (not words in multi-word tokens):

SpacesBefore=content (by default empty): spaces/other content preceding the token
SpacesAfter=content (by default a space if SpaceAfter=No feature is not present, empty otherwise): spaces/other content following the token
SpacesInToken=content (by default equal to the FORM of the token): FORM of the token including original spaces (this is needed only if tokens are allowed to contain spaces and a token contains a tab or newline characters)
The content of all the three fields must be escaped to allow storing tabs and newlines. The following C-like schema is used:

\s: space
\t: tab
\r: CR character
\n: LF character
\p: | (pipe character)
\\: \ (backslash character)

Answer 1 · 2019-10-22T12:57:12.000Z

This was just to inform you.

Answer 2 · 2019-10-22T21:25:33.000Z

FYI. This function reconstructs the text from a udpipe tokenised dataset https://github.com/bnosac/udpipe/blob/master/R/udpipe_reconstruct.R

Answer 3 · 2019-10-23T14:02:02.000Z

Oh, that's actually even better (reconstructing the text what I really always want anyway)! Thanks again!

Answer 4 · 2019-10-23T14:03:01.000Z

I needed to push out the updated 3.0.0 ahead of a workshop next week, but will be working on more minor revisions for 3.0.1; I'll probably just return something similar to text_with_ws that space yields.

Answer 5 · 2019-10-23T14:09:37.000Z

Good luck with the workshop.