ignore BOM

Question

ignore BOM

hansifer opened this issue a year ago · 6 comments

I ran into an issue where the tokenizer is choking on files with a BOM. This throws with Error: Unexpected "ï" at position "0" in state START.

I was able to patch tokenizer with a quick and dirty addition of TokenizerStates.BOM. Unfortunately I don't have time to submit a formal PR but wanted to raise the issue for tracking.

Answer 1 · 2024-01-19T17:35:29.000Z

Hi @hansifer ,

Can you provide a sample input to test the issue?
Also, can you provide at least a code snipet of the workaround that you did to have a starting point?
I can then add tests and put the change in 🙂

Answer 2 · 2024-01-19T18:12:41.000Z

json-with-utf-8-bom.json

This is a simple json file with a UTF-8 BOM (initial byte sequence EF BB BF). The BOM displays differently when opened in Sublime Text or Notepad++ and not at all in VS Code, but they all identify this kind of file as "UTF-8 with BOM". I think BOM's are relatively rare for UTF-8 files so this can reasonably be considered an edge case, but my software's old Export feature includes it so I need to accommodate it in the Import.

As an aside, TextDecoder will by default skip over the BOM but gives an option not to. https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/ignoreBOM

Anyway, as mentioned code changes were made in haste so I'm sure I missed some things but here's a summary of what I added (inspired by TokenizerStates.SEPARATOR code):

added to Tokenizer constructor:

this.bom = [0xef,0xbb,0xbf]
this.bomIndex = 0;

added to Tokenizer write loop START case:

case TokenizerStates.START:
    this.offset += 1;
    if (n === this.bom[0]) {
        // console.log('found BOM')
        this.state = TokenizerStates.BOM;
        continue;
    }

added Tokenizer write loop BOM case:

case TokenizerStates.BOM:
    // console.log('processing BOM')
    this.bomIndex += 1;
    if (n !== this.bom[this.bomIndex]) {
        break;
    }
    if (this.bomIndex === this.bom.length - 1) {
        this.state = TokenizerStates.START;
        this.bomIndex = 0;
    }
    continue;

Excellent library by the way! Thanks for creating it.

Answer 3 · 2024-01-20T16:38:58.000Z

Published as part of v0.0.20.

BOM is now supported for Uint8Array, Uint16Array and Uint32Array.

Answer 4 · 2024-01-20T19:04:42.000Z

Thanks for the quick fix!

Unfortunately I'm now running into this error using that same sample file I provided:

Error: Unexpected "ï" at position "3" in state START

setting reader up this way:

const parser = new JSONParser({
  stringBufferSize: undefined,
  keepStack: false,
})

const stream = file.stream().pipeThrough(parser)
const reader = stream.getReader()

Any suggestions?

Answer 5 · 2024-01-20T19:37:58.000Z

Hi @hansifier,

Sorry, I just reviewed your file and it includes the BOM sequence twice.

I implemented BOM support following the unicode rules.

Q: What should I do with U+FEFF in the middle of a file?

In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character. [AF]

So, I treat the BOM sequence as an unsupported character except for the first sequence at the beginning of the file.

Do you have any idea of why your file might be "wrong"?

Answer 6 · 2024-01-20T20:11:32.000Z

Oh wow. You're right. There's an old bug in my code that was manually adding the BOM and so was FileSaver. 🤦‍♂️

Sorry I completely missed that.

Thanks for your quick responses and patience.