dbushell/xml-streamify

Bug: `'` is incorrectly treated as delimiter inside a double quoted attribute

Closed this issue ยท 8 comments

/([\w:.-]+)\s*=\s*["'](.*?)["']/g used in decrypting attributes will incorrectly deal with the attributes containing ' character as delimiter.

Test cases:
<element attribute='This is a "double quote" inside single quotes'></element>

<element attribute="This is a 'single quote' inside double quotes"></element>

<element attribute='This is an apostrophe: &apos;'></element>

<element attribute="This is a &quot;double quote&quot; inside double quotes"></element>

Thanks I'll investigate and fix.

I think the fixed pattern is /([\w:.-]+)\s*=\s*(["'])(.*?)\2/g

I'm capturing the first ["'] and replacing the second with \2 to ensure the quotes match.

Will test further a push a fix tomorrow.

It doesn't seem to work on my end. I'll prepare a quick Deno test

This is the test I'm using:

const data = `<root>
<element attribute='This is a "double quote" inside single quotes'></element>
<element attribute="This is a 'single quote' inside double quotes"></element>
<element attribute='This is an apostrophe: &apos;'></element>
<element attribute="This is a &quot;double quote&quot; inside double quotes"></element>
</root>
`;

const parser = parse('data:application/xml,' + encodeURIComponent(data));
for await (const node of parser) {
  console.log(node.type, node.attributes);
}

With the current version 0.3.0 I get the bug issue:

element { attribute: "This is a " }
element { attribute: "This is a " }
element { attribute: "This is an apostrophe: &apos;" }
element { attribute: "This is a &quot;double quote&quot; inside double quotes" }
root {}

With the fixed pattern I get:

element { attribute: 'This is a "double quote" inside single quotes' }
element { attribute: "This is a 'single quote' inside double quotes" }
element { attribute: "This is an apostrophe: &apos;" }
element { attribute: "This is a &quot;double quote&quot; inside double quotes" }
root {}

Something is missing though.

Try this one please:

import { assertEquals } from "https://deno.land/std@0.219.0/assert/mod.ts";
import { existsSync } from "https://deno.land/std@0.219.0/fs/exists.ts";
import { parse } from "./parse.ts";

const testFilePath = async () => {
  const path = await Deno.realPath("./english-wordnet-2023.xml");
  return path
}

const testFileExists = async () => {
  if (existsSync("./english-wordnet-2023.xml")) {
    const path = await Deno.realPath("./english-wordnet-2023.xml");
    const stat = await Deno.stat(path)
    return stat.isFile
  }
  return false
}

const fetchTestFile = async () => {
  const src = await fetch("https://en-word.net/static/english-wordnet-2023.xml.gz");
  const dest = await Deno.open("./english-wordnet-2023.xml", {
    create: true,
    write: true,
  });
  if (src.body == null) return;
  await src.body
    .pipeThrough(new DecompressionStream("gzip"))
    .pipeTo(dest.writable);
}

Deno.test("quotes", async () => {
  if (!await testFileExists()) {
    console.log('unzipping')
    await fetchTestFile()
  }

  const p = await testFilePath()

  const parser = parse(`file:///${p.replace('\\', '/')}`, {
    ignoreDeclaration: false,
    silent: false
  });

  const expect = [
    { lemma: "tailor", count: 2 }, // "v" and "n"
    { lemma: "guard", count: 2 }, // "v" and "n"
    { lemma: "tailor's tack", count: 1 },
    { lemma: "Aladdin", count: 1 },
    { lemma: "Aladdin's lamp", count: 1 },
  ]
  const found: Map<string, number> = new Map()

  let count = 0

  for await (const node of parser) {
    if (node.type == "Lemma") {
      const writtenForm = node.attributes["writtenForm"]

      expect.forEach((v) => {
        if (writtenForm == v.lemma) {
          found.set(v.lemma, (found.get(v.lemma) || 0) + 1)
          console.log(node.raw)
        }
      })
      count++
    }
  }

  for (const e of expect) {
    const f = found.get(e.lemma) || 0
    assertEquals(f, e.count, `should be ${e.count} of "${e.lemma}" lemmas, found ${f} instead`)
  }
  console.log(`${count} lemmas processed`)
});

With your current suggestion it fails with not founding the lemmas.

But if we simply remove the ' from the regex (as /([\w:.-]+)\s*=\s*["](.*?)["]/g, since the source file contains only " as delimiters it's not bumping into an apostrophe situation), it finds appropriate nodes with correct amounts.

Thanks for the thorough test.

Sorry I failed to mention another change in node.ts line 47:

diff --git a/src/node.ts b/src/node.ts
index c26ec47..9c813d7 100644
--- a/src/node.ts
+++ b/src/node.ts
@@ -41,10 +41,10 @@ export class Node {
     // Setup and parse attributes on first access
     this.#attr = {};
     if (this.raw) {
-      const regex = /([\w:.-]+)\s*=\s*["'](.*?)["']/g;
+      const regex = /([\w:.-]+)\s*=\s*(["'])(.*?)\2/g;
       let match: RegExpExecArray | null;
       while ((match = regex.exec(this.raw)) !== null) {
-        this.#attr[match[1]] = match[2];
+        this.#attr[match[1]] = match[3];
       }
     }
     return this.#attr;

Because it's doing an extra capture we need to bump 2 to 3 for the attribute value.

With that fix my test passes:

------- output -------
<Lemma writtenForm="guard" partOfSpeech="n">
<Lemma writtenForm="tailor's tack" partOfSpeech="n" />
<Lemma writtenForm="Aladdin's lamp" partOfSpeech="n" />
<Lemma writtenForm="tailor" partOfSpeech="n">
<Lemma writtenForm="Aladdin" partOfSpeech="n" />
<Lemma writtenForm="guard" partOfSpeech="v">
<Lemma writtenForm="tailor" partOfSpeech="v">
161338 lemmas processed
----- output end -----
quotes ... ok (4s)

ok | 1 passed | 0 failed (4s)

Ah! That was a missing piece!

This is fixed and published in v0.4 thanks for the help.