Encoding issue in traindata

Question

Encoding issue in traindata

jheinecke opened this issue 4 years ago · 2 comments

Hi,
I've stumbled on a strange problem reading the files downloaded with gsutil -m cp -R gs://natural_questions/v1.0 <mydir>.
I try to find the answers using the start_byte and end_byte positions of the tokens in document_html. tokens with low start/end_byte are correct, but later on in the document the positions are wrong. Using the following python3 script shows the error:

import json

fp = open("nq-train-00.jsonl", encoding="utf-8")
line = fp.readline()
j = json.loads(line)

for toks in j["document_tokens"]:  
    print("token:    {%s}\t%d\t%d" % (toks["token"], toks["start_byte"], toks["end_byte"]))
    print(" in text: {%s}" % (j["document_html"][toks["start_byte"]:toks["end_byte"]]))
    print()

This produces in the beginning a correct correspondance between the token mentionned in document_token

token:    {The} 92      95
 in text: {The}

token:    {Walking}     96      103
 in text: {Walking}

token:    {Dead}        104     108
 in text: {Dead}

but later on, notably after a unbreakable space (U+00A0) in document_html things get weird:

token:    {season}      53862   53868
 in text: {season}

token:    {8}   53870   53871
 in text: {)}

token:    {)}   53871   53872
 in text: {<}

token:    {</Th>}       53872   53877
 in text: {/TH> }

token:    {</Tr>}       53878   53883
 in text: {/TR> }

It looks as if the start/end_bytes are shifted. The same happens with mdashes — (U+2014), ←

Is there a corrected version available, or is there is list of characters which have been replaced by a sequence of characters before calculating start/end_byte

Answer 1 · 2022-03-27T05:14:20.000Z

I'm having the same issue, does someone have any advice?

Answer 2 · 2023-08-29T20:16:11.000Z

For posterity: you need to encode the string into bytes. In the case of @jheinecke 's original code, j["document_html"] should become j["document_html"].encode("utf-8").

To be clear, you have to decode the json to a dict, extract the "document_html", and then re-encode it in utf-8. If you try to skip the json part, e.g., by directly grabbing whatever follows b'"document_html":"' in the raw jsonl, then you would have to handle the json string escapes yourself (so, don't do that).

Another thing to keep in mind is that there is some context-dependent canonicalization going on from bytes to tokens.