Encoding issue in traindata
jheinecke opened this issue · 2 comments
Hi,
I've stumbled on a strange problem reading the files downloaded with gsutil -m cp -R gs://natural_questions/v1.0 <mydir>
.
I try to find the answers using the start_byte
and end_byte
positions of the tokens in document_html
. tokens with low start/end_byte
are correct, but later on in the document the positions are wrong. Using the following python3 script shows the error:
import json
fp = open("nq-train-00.jsonl", encoding="utf-8")
line = fp.readline()
j = json.loads(line)
for toks in j["document_tokens"]:
print("token: {%s}\t%d\t%d" % (toks["token"], toks["start_byte"], toks["end_byte"]))
print(" in text: {%s}" % (j["document_html"][toks["start_byte"]:toks["end_byte"]]))
print()
This produces in the beginning a correct correspondance between the token mentionned in document_token
token: {The} 92 95
in text: {The}
token: {Walking} 96 103
in text: {Walking}
token: {Dead} 104 108
in text: {Dead}
but later on, notably after a unbreakable space (U+00A0) in document_html
things get weird:
token: {season} 53862 53868
in text: {season}
token: {8} 53870 53871
in text: {)}
token: {)} 53871 53872
in text: {<}
token: {</Th>} 53872 53877
in text: {/TH> }
token: {</Tr>} 53878 53883
in text: {/TR> }
It looks as if the start/end_bytes
are shifted. The same happens with mdashes — (U+2014), ←
Is there a corrected version available, or is there is list of characters which have been replaced by a sequence of characters before calculating start/end_byte
I'm having the same issue, does someone have any advice?
For posterity: you need to encode the string into bytes. In the case of @jheinecke 's original code, j["document_html"]
should become j["document_html"].encode("utf-8")
.
To be clear, you have to decode the json to a dict, extract the "document_html"
, and then re-encode it in utf-8. If you try to skip the json part, e.g., by directly grabbing whatever follows b'"document_html":"'
in the raw jsonl, then you would have to handle the json string escapes yourself (so, don't do that).
Another thing to keep in mind is that there is some context-dependent canonicalization going on from bytes to tokens.