Bug: `'` is incorrectly treated as delimiter inside a double quoted attribute
Closed this issue ยท 8 comments
/([\w:.-]+)\s*=\s*["'](.*?)["']/g
used in decrypting attributes will incorrectly deal with the attributes containing '
character as delimiter.
Test cases:
<element attribute='This is a "double quote" inside single quotes'></element>
<element attribute="This is a 'single quote' inside double quotes"></element>
<element attribute='This is an apostrophe: ''></element>
<element attribute="This is a "double quote" inside double quotes"></element>
Thanks I'll investigate and fix.
I think the fixed pattern is /([\w:.-]+)\s*=\s*(["'])(.*?)\2/g
I'm capturing the first ["']
and replacing the second with \2
to ensure the quotes match.
Will test further a push a fix tomorrow.
It doesn't seem to work on my end. I'll prepare a quick Deno test
This is the test I'm using:
const data = `<root>
<element attribute='This is a "double quote" inside single quotes'></element>
<element attribute="This is a 'single quote' inside double quotes"></element>
<element attribute='This is an apostrophe: ''></element>
<element attribute="This is a "double quote" inside double quotes"></element>
</root>
`;
const parser = parse('data:application/xml,' + encodeURIComponent(data));
for await (const node of parser) {
console.log(node.type, node.attributes);
}
With the current version 0.3.0
I get the bug issue:
element { attribute: "This is a " }
element { attribute: "This is a " }
element { attribute: "This is an apostrophe: '" }
element { attribute: "This is a "double quote" inside double quotes" }
root {}
With the fixed pattern I get:
element { attribute: 'This is a "double quote" inside single quotes' }
element { attribute: "This is a 'single quote' inside double quotes" }
element { attribute: "This is an apostrophe: '" }
element { attribute: "This is a "double quote" inside double quotes" }
root {}
Something is missing though.
Try this one please:
import { assertEquals } from "https://deno.land/std@0.219.0/assert/mod.ts";
import { existsSync } from "https://deno.land/std@0.219.0/fs/exists.ts";
import { parse } from "./parse.ts";
const testFilePath = async () => {
const path = await Deno.realPath("./english-wordnet-2023.xml");
return path
}
const testFileExists = async () => {
if (existsSync("./english-wordnet-2023.xml")) {
const path = await Deno.realPath("./english-wordnet-2023.xml");
const stat = await Deno.stat(path)
return stat.isFile
}
return false
}
const fetchTestFile = async () => {
const src = await fetch("https://en-word.net/static/english-wordnet-2023.xml.gz");
const dest = await Deno.open("./english-wordnet-2023.xml", {
create: true,
write: true,
});
if (src.body == null) return;
await src.body
.pipeThrough(new DecompressionStream("gzip"))
.pipeTo(dest.writable);
}
Deno.test("quotes", async () => {
if (!await testFileExists()) {
console.log('unzipping')
await fetchTestFile()
}
const p = await testFilePath()
const parser = parse(`file:///${p.replace('\\', '/')}`, {
ignoreDeclaration: false,
silent: false
});
const expect = [
{ lemma: "tailor", count: 2 }, // "v" and "n"
{ lemma: "guard", count: 2 }, // "v" and "n"
{ lemma: "tailor's tack", count: 1 },
{ lemma: "Aladdin", count: 1 },
{ lemma: "Aladdin's lamp", count: 1 },
]
const found: Map<string, number> = new Map()
let count = 0
for await (const node of parser) {
if (node.type == "Lemma") {
const writtenForm = node.attributes["writtenForm"]
expect.forEach((v) => {
if (writtenForm == v.lemma) {
found.set(v.lemma, (found.get(v.lemma) || 0) + 1)
console.log(node.raw)
}
})
count++
}
}
for (const e of expect) {
const f = found.get(e.lemma) || 0
assertEquals(f, e.count, `should be ${e.count} of "${e.lemma}" lemmas, found ${f} instead`)
}
console.log(`${count} lemmas processed`)
});
With your current suggestion it fails with not founding the lemmas.
But if we simply remove the '
from the regex (as /([\w:.-]+)\s*=\s*["](.*?)["]/g
, since the source file contains only "
as delimiters it's not bumping into an apostrophe situation), it finds appropriate nodes with correct amounts.
Thanks for the thorough test.
Sorry I failed to mention another change in node.ts line 47:
diff --git a/src/node.ts b/src/node.ts
index c26ec47..9c813d7 100644
--- a/src/node.ts
+++ b/src/node.ts
@@ -41,10 +41,10 @@ export class Node {
// Setup and parse attributes on first access
this.#attr = {};
if (this.raw) {
- const regex = /([\w:.-]+)\s*=\s*["'](.*?)["']/g;
+ const regex = /([\w:.-]+)\s*=\s*(["'])(.*?)\2/g;
let match: RegExpExecArray | null;
while ((match = regex.exec(this.raw)) !== null) {
- this.#attr[match[1]] = match[2];
+ this.#attr[match[1]] = match[3];
}
}
return this.#attr;
Because it's doing an extra capture we need to bump 2
to 3
for the attribute value.
With that fix my test passes:
------- output -------
<Lemma writtenForm="guard" partOfSpeech="n">
<Lemma writtenForm="tailor's tack" partOfSpeech="n" />
<Lemma writtenForm="Aladdin's lamp" partOfSpeech="n" />
<Lemma writtenForm="tailor" partOfSpeech="n">
<Lemma writtenForm="Aladdin" partOfSpeech="n" />
<Lemma writtenForm="guard" partOfSpeech="v">
<Lemma writtenForm="tailor" partOfSpeech="v">
161338 lemmas processed
----- output end -----
quotes ... ok (4s)
ok | 1 passed | 0 failed (4s)
Ah! That was a missing piece!
This is fixed and published in v0.4 thanks for the help.