note on spacesafter
jwijffels opened this issue · 5 comments
from here https://ufal.mff.cuni.cz/udpipe/users-manual
Basically this means the misc field can have SpacesBefore=/SpacesAfter=/SpacesInToken=
with the following possible values
- \s: space
- \t: tab
- \r: CR character
- \n: LF character
- \p: | (pipe character)
- \: \ (backslash character)
You can see that in e.g.
> library(udpipe)
> x <- udpipe(" .It remains all spaces. You see\n\n\n. We started a new paragraph.", "english")
> x[, c("doc_id", "paragraph_id", "sentence_id", "term_id", "token", "misc")]
doc_id paragraph_id sentence_id term_id token misc
1 doc1 1 1 1 . SpacesBefore=\\s|SpaceAfter=No
2 doc1 1 2 2 It <NA>
3 doc1 1 2 3 remains <NA>
4 doc1 1 2 4 all <NA>
5 doc1 1 2 5 spaces SpaceAfter=No
6 doc1 1 2 6 . <NA>
7 doc1 1 3 7 You <NA>
8 doc1 1 3 8 see SpacesAfter=\\n\\n\\n
9 doc1 2 4 9 . <NA>
10 doc1 2 5 10 We <NA>
11 doc1 2 5 11 started <NA>
12 doc1 2 5 12 a <NA>
13 doc1 2 5 13 new <NA>
14 doc1 2 5 14 paragraph SpaceAfter=No
15 doc1 2 5 15 . SpacesAfter=\\n
Except the last (bnosac/udpipe#27), this is because of a bug in the R package I maintain at bnosac/udpipe#27 which I still need to fix
By default, UDPipe uses custom MISC fields to store all spaces in the original document. This markup is backward compatible with CoNLL-U v2 SpaceAfter=No feature. This markup can be utilized by the plaintext output format, which allows reconstructing the original document.
Note that in theory not only spaces, but also other original content can be saved in this way (for example XML tags if the input was encoded in a XML file).
The markup uses the following MISC fields on tokens (not words in multi-word tokens):
SpacesBefore=content (by default empty): spaces/other content preceding the token
SpacesAfter=content (by default a space if SpaceAfter=No feature is not present, empty otherwise): spaces/other content following the token
SpacesInToken=content (by default equal to the FORM of the token): FORM of the token including original spaces (this is needed only if tokens are allowed to contain spaces and a token contains a tab or newline characters)
The content of all the three fields must be escaped to allow storing tabs and newlines. The following C-like schema is used:
\s: space
\t: tab
\r: CR character
\n: LF character
\p: | (pipe character)
\\: \ (backslash character)
This was just to inform you.
FYI. This function reconstructs the text from a udpipe tokenised dataset https://github.com/bnosac/udpipe/blob/master/R/udpipe_reconstruct.R
Oh, that's actually even better (reconstructing the text what I really always want anyway)! Thanks again!
I needed to push out the updated 3.0.0 ahead of a workshop next week, but will be working on more minor revisions for 3.0.1; I'll probably just return something similar to text_with_ws that space yields.
Good luck with the workshop.