justinwilaby/sax-wasm

lower case doctype is not recognised

samuelcolvin opened this issue · 11 comments

(Thanks so much for the library, it's awesome. I'm particularly appreciative that it's very flexible about the input, except in this case...)

Describe the bug

Lower case "doctype" is not supported, although it's valid in HTML.

To Reproduce

Any input with doctype as lowercase not upper case:

const fs = require('fs')
const { SaxEventType, SAXParser } = require('sax-wasm')

const saxWasmBuffer = fs.readFileSync(require.resolve('sax-wasm/lib/sax-wasm.wasm'))

const get_type = (t) => {
  for (const [name, value] of Object.entries(SaxEventType)) {
    if (t === value) {
      return name
    }
  }
}
const str2array = (str) => new Uint8Array(Buffer.from(str, 'utf8'))

// Instantiate
const options = {highWaterMark: 32 * 1024}
const parser = new SAXParser(SaxEventType.Attribute | SaxEventType.CloseTag | SaxEventType.OpenTag | SaxEventType.Text | SaxEventType.Doctype, options)
parser.eventHandler = (event, data) => {
  console.log(`${get_type(event)}:`, data.toJSON())
}

parser.prepareWasm(saxWasmBuffer).then(() => {
  console.log('lower case:')
  parser.write(str2array('<!doctype html>'))
  parser.end()
  console.log('upper case:')
  parser.write(str2array('<!DOCTYPE html>'))
  parser.end()
})

This prints nothing for the first case but does find the DOCTYPE clause for the second case.

I think the problem might be:

"DOCTYPE" => {

which appears to require "DOCTYPE"

Reference for doctype being case-insensitive: https://html.spec.whatwg.org/multipage/syntax.html#the-doctype

@samuelcolvin -

Thank you for the bug submission!

This is a particularly challenging issue that I knew about going in. Since WebAssembly is always fully self-contained with zero built-ins, writing a case insensitive match would mean including the entire utf-8 character library in the binary. I started here when writing this util but noticed the binary was just under 2mb in size so I had to make a decision about case sensitivity. Adding to_lower_case() or to_upper_case() adds a surprising amount of overhead in WebAssembly.

Adding the lower case doctype to the match makes sense but this would omit docType, DocType and all permutations of case which means this library would still not be up to spec.

For your use case, would it be acceptable to have doctype and DOCTYPE and skip the case normalization?

Couldn't agree more about size, see this discussion I started about wasm on cloudflare workers.

Yes, for me allowing lower case would be sufficient.

I've been thinking about this. I get that to_lower_case() etc. are heavy because they have to deal with all unicode cases, but can't you do some simple ascii case-insensitive string comparisons without all that?

Something like this:

static DOCSTRING: &str = "docstring";

fn ascii_icompare(expected: &str, test: &str) -> bool {
    if expected.len() != test.len() {
        return false;
    }
    for (e, t) in expected.chars().zip(test.chars()) {
        let char_diff = (e as i8) - (t as i8);
        if !(char_diff == 0 || char_diff == 32) {
            return false;
        }
    }
    true
}

fn main() {
    let result: bool = ascii_icompare(DOCSTRING, "DocString");
    println!("result: {:?}", result);
}

My rust is a bit, well, rusty, but surely something like that is possible without know anymore about utf-8 than that caps and non-caps are 32 away from each other?

Brilliant! This is a good idea. I'd just need to adjust your code to use absolute values or test for -32 and I think it will work.
Thank you for the suggestion.

Another option I was toying with is to just report all SGML decls as-is with no distinctions between doctype, comment or cdata and "other". Reporting all declarations in a single event would allow the client to grab whatever is there as-needed and the wasm would no longer need to track which type it is, reducing the complexity. Thoughts?

On the subject of performance, the thread you referenced was an interesting read and certainly coincides with what I observed working with webAssembly as well. Cold starts are testy. This is especially the case when stuffing big chunks of data in memory. For example, in my sym-spell port, pulling this block from where it is and waiting to load the dictionary in a separate operation, slows down startup by 30x even though the code is the same in every way, except in a separate function. Keeping it where it is significantly speeds up startups.

Anyway, I would love to increment the number of contributors to sax-wasm and your solution looks really good. Would you be willing to install your solution and raise a PR for this?

I'd just need to adjust your code to use absolute values or test for -32 and I think it will work.

I don't think so, the expected (perhaps better name required) should always be lower case. Let's say it has a character a (97), this should allow a (97, 97-97=0) or A (65, 97-65=32), but not \x81 (129). E.g. I don't think it needs an abs().

Reporting all declarations in a single event would allow the client to grab whatever is there as-needed and the wasm would no longer need to track which type it is, reducing the complexity.

Personally I'd like to be able to ignore comments completely, so I'd want be able to ignore them with SaxEventType argument for performance.

Creating a PR

Of course I'd like to help but I 55 PRs to review on pydantic alone, I doubt I'll have the capacity to help here in the next few weeks. Sorry to be useless.

Understood, I should be able to get to this shortly.

Thank you.

resolved by #44

v2.0.1 published! Thanks @samuelcolvin !

Thanks so much for resolving this, I've upgraded and tests are passing!