Incorporate gumbo fixes from lua-gumbo
stevecheckoway opened this issue · 21 comments
lua-gumbo has made a bunch of improvements to gumbo. We should incorporate their fixes!
- craigbarnes/lua-gumbo@91cef89: Re-implement
adjust_foreign_attributes()
with a gperf hash - craigbarnes/lua-gumbo@b11abe7: Pass
TagSet
arrays into functions by reference instead of value - craigbarnes/lua-gumbo@b73dc03: Simplify
maybe_replace_codepoint()
function - craigbarnes/lua-gumbo@d5d0bb3: Remove special handling of
<menuitem>
tag - craigbarnes/lua-gumbo@7bd5162: Remove special handling of
<isindex>
tag - craigbarnes/lua-gumbo@a5c1b0e: Use
realloc(3)
instead ofmalloc(3)
inenlarge_vector_if_full()
- craigbarnes/lua-gumbo@dcbebd7: Use
realloc(3)
instead ofmalloc(3)
inmaybe_resize_string_buffer()
- craigbarnes/lua-gumbo@df15262: Make
destroy_node()
function non-recursive - craigbarnes/lua-gumbo@2df37f5: Fix signedness of some format specifiers
- craigbarnes/lua-gumbo@176553e: Add maximum element nesting limit
- craigbarnes/lua-gumbo@bed0f4a: Annotate
gumbo_debug()
withPRINTF
macro and fix warnings - craigbarnes/lua-gumbo@7ffc218: Annotate
print_message()
withPRINTF
macro and fix warnings - craigbarnes/lua-gumbo@1bd8ab5, craigbarnes/lua-gumbo@9136507, craigbarnes/lua-gumbo@53a1f9a: Deduplicate some identical
TagSet
arrays - craigbarnes/lua-gumbo@a7a9065: Add some GCC/Clang function attributes
- craigbarnes/lua-gumbo@8d3d4e4: Remove custom allocator support
- craigbarnes/lua-gumbo@8d3b006: Fix recording of source positions for
</form>
end tags - craigbarnes/lua-gumbo@1a8d763: Replace linear search in
maybe_replace_codepoint()
with a lookup table - craigbarnes/lua-gumbo@6dca79e: Replace
strcasecmp()
andstrncasecmp()
with ascii-only equivalents - craigbarnes/lua-gumbo@17ab1d2: Fix
TAGSET_INCLUDES
macro to work properly with multiple bit flags - craigbarnes/lua-gumbo@7e56d45: Re-implement
gumbo_normalize_svg_tagname()
with a gperf hash - craigbarnes/lua-gumbo@a518d35: Replace linear array search in
adjust_svg_attributes()
with a gperf hash - craigbarnes/lua-gumbo@a4a7433: Fix duplicate
TagSet
initializer being ignored inis_special_node()
- craigbarnes/lua-gumbo@8137fcd: Add support for
<dialog>
tag - craigbarnes/lua-gumbo@4b35471: Add missing
static
qualifiers to hide symbols that shouldn't be extern - craigbarnes/lua-gumbo@df57c59, craigbarnes/lua-gumbo@03101f3, craigbarnes/lua-gumbo@ea62330: Replace use of locale-dependant
ctype.h
functions with custom, ASCII-only equivalents
Note: the following pairs are probably best merged into single commits:
- craigbarnes/lua-gumbo@1a8d763 and craigbarnes/lua-gumbo@b73dc03
- craigbarnes/lua-gumbo@a4a7433 and craigbarnes/lua-gumbo@17ab1d2
These should be easy to cherry-pick without requiring any earlier commits:
- craigbarnes/lua-gumbo@df15262
- craigbarnes/lua-gumbo@7bd5162
- craigbarnes/lua-gumbo@d5d0bb3
- craigbarnes/lua-gumbo@8137fcd
The rest probably need to be done bottom to top. I threw out the original Python generator scripts and autotools build system, so a few commits might also need adjusting to account for that too.
Any chance of a shared common codebase?
To be honest, probably not. After seeing some of the conflicting feature requests and ABI stability issues the original author had to deal with, I'm glad to be maintaining just an internal fork and not a shared library.
FWIW, there's really not that much substance in the above commits. The larger ones are purely for performance. The rest are mostly trivial. IMO the library was in pretty good shape by the time it was abandoned. The only part noticeably buggier than the rest seems to be error.c
, but I'm not using that in lua-gumbo.
@craigbarnes If you ever decide to use the error portion, I had a few fixes
- c685435 Handle zero bytes in the input without crashing and add missing parse error.
- 60fa747 Improve the error API slightly by adding explicit lengths.
The error API could be improved further (and more error messages could be implemented), but it's pretty low priority.
@craigbarnes I probably shouldn't have used the words shared library. I didn't mean that in the runtime sense. I meant shared source.
Let me ask the question a different way. Are you willing to accept patches to the gumbo parser even if the fixes are to portions of the code that you aren't currently using (e.g., the error API)?
Previously, nokogumbo got access to the source code of gumbo-parser as a git submodule. Is there any reason that nokogumbo couldn't have lua-gumbo as a git submodule, even if the only portion of the code it accesses from that tree is the lib subdirectory?
Upside to lua-gumbo: you may get more fixes.
Downside to lua-gumo: some of those fixes may be to code that you don't use, or conceivably could even break lua-gumbo.
Upside to nokogumbo: picking up fixes from lua-gumbo would be as simple as updating the submodule.
Downside to nokogumbo: lua-gumbo may not accept fixes, or fixes could conceivably break nokogumbo.
Alternatives:
- we create a joint fork of gumbo-parser and shed the parts we don't need and both projects import that codebase.
- we put lua-gumbo and nokogumbo into a common repository and the build and test process produces and verifies two outputs. There would be no runtime dependency Ruby by the Lua output, and conversely no runtime dependency on Lua by the Ruby output.
Let me ask the question a different way. Are you willing to accept patches to the gumbo parser even if the fixes are to portions of the code that you aren't currently using (e.g., the error API)?
Yes. I intend to eventually clean up and use the error API anyway. I don't foresee removing anything from the library, besides the already removed custom allocator support.
Previously, nokogumbo got access to the source code of gumbo-parser as a git submodule. Is there any reason that nokogumbo couldn't have lua-gumbo as a git submodule
There's no reason you couldn't. I'm fine with people using the fork via whatever method they prefer, so long as they're ok with having no guarantees of API stability.
Downside to lua-gumo: some of those fixes may be to code that you don't use, or conceivably could even break lua-gumbo.
The former is fine by me. I won't merge anything that will break the Lua bindings though.
Downside to nokogumbo: lua-gumbo may not accept fixes, or fixes could conceivably break nokogumbo.
I'll accept anything that improves the correctness of the existing code. If I have to change the API, I'll do so as minimally as possible and make sure it doesn't go unnoticed.
- we create a joint fork of gumbo-parser and shed the parts we don't need and both projects import that codebase.
That would just invite the question "why not a shared library"? The only thing I'm adamantly against is maintaining a shared library.
- we put lua-gumbo and nokogumbo into a common repository and the build and test process produces and verifies two outputs.
It seems like that would just clutter the git history of both projects. Depending on what happens down the line, I may change my mind, but for now the option that sounds most agreeable to me is nokogumbo using lua-gumbo as a submodule.
By the way, if you opt for the submodule option, you'll probably want to handle the new GumboOutput::status
field (specifically, when it's equal to GUMBO_STATUS_TREE_TOO_DEEP
).
If you ever decide to use the error portion, I had a few fixes
c685435
Handle zero bytes in the input without crashing and add missing parse error.
@stevecheckoway Is this specifically triggered by calling gumbo_caret_diagnostic_to_string()
? I'll pull your fix when I get some time to write a few test cases.
The error API could be improved further (and more error messages could be implemented), but it's pretty low priority.
Yeah, it needs a complete overhaul really. In the original libgumbo, it wasn't installed by "make install" and was considered a private/internal header. How much of the error API is exposed in nokogumbo?
@craigbarnes GUMBO_STATUS_OUT_OF_MEMORY
seems unused. What's the rationale behind a 400-deep limit for GUMBO_STATUS_TREE_TOO_DEEP
?
@stevecheckoway Is this specifically triggered by calling gumbo_caret_diagnostic_to_string()? I'll pull your fix when I get some time to write a few test cases.
Yes, the crash is due to calling gumbo_caret_diagnostic_to_string
How much of the API is exposed in nokogumbo?
None. The errors are exposed to the caller via an array of Nokogiri::XML::SyntaxError
s.
I spent way too long today
- writing some code to pull patches from lua-gumbo, change paths (from
lib/blah
togumbo-parser/src/blah
) and omit patches to files outside thelib
; - reseting nokogumbo's copy of gumbo-parser to the same version you started with;
- applying all of the patches in order; and
- re-implementing my patches on top of yours ace07d0.
I did this because I had trouble cherry-picking or otherwise merging the patches you mentioned. Did you clang-format the code or something?
It might be easier for you to grab ace07d0 than my previous commits.
GUMBO_STATUS_OUT_OF_MEMORY
seems unused.
It is unused at the moment. I envisioned improving OOM handling in the future by using a status code instead of calling abort()
, but that hasn't been implemented yet. The original Gumbo author also took this approach in his arena
branch, but it was never finished.
What's the rationale behind a 400-deep limit for GUMBO_STATUS_TREE_TOO_DEEP?
- Preventing google/gumbo-parser#393.
- Preventing a recursive tree builder overflowing the stack.
- Preventing possible DoS attack by
"<div>".repeat(2^20)
- No sane, non-malicious document has a tree depth of more than 150 or so.
The limit was originally 512, to match the limit used by WebKit. I reduced it to 400 to work around what seems to be a bug in OpenBSD causing excessive stack usage.
I'll add a field to GumboOptions
to make the limit configurable. You could probably increase it if you want -- I'm pretty sure the bug is triggered by the Lua tree builder (gumbo/parse.c
). The parser itself seems to handle tree depths of 50,000+ no problem.
I spent way too long today
- writing some code to pull patches from lua-gumbo, change paths (from lib/blah to gumbo-parser/src/blah) and omit patches to files outside the lib;
- reseting nokogumbo's copy of gumbo-parser to the same version you started with;
- applying all of the patches in order; and
- re-implementing my patches on top of yours ace07d0.
It seems to have worked pretty well anyway. It took me a lot longer to create the commits ;)
I did this because I had trouble cherry-picking or otherwise merging the patches you mentioned. Did you clang-format the code or something?
Actually I did the opposite. The Gumbo maintainer destroyed the formatting of the code with the abomination known as clang-format
and I spent ages manually reformatting it.
It might be easier for you to grab ace07d0 than my previous commits.
Thanks. I'll take a look at that in a while.
This all sounds good to me.
What do you do when the limit is hit? Do you return a partial tree in lua? @rubys, do you have any thoughts about what nokogumbo should do in this case?
What do you do when the limit is hit? Do you return a partial tree in lua?
I just call gumbo_destroy_output()
and return an error. I considered returning a partial tree with a boolean field marking it as such, but the possibility of people just ignoring it and treating it like a complete tree seemed too likely. If I were going to make it possible to get the partial tree, I'd probably add some kind of option to explicitly request it beforehand.
We're probably talking about a fraction of 1% edge case here though.
That seems sensible to me.
It might be easier for you to grab ace07d0 than my previous commits.
I couldn't figure out why your test cases wouldn't crash lua-gumbo, even when calling gumbo_caret_diagnostic_to_string()
without your patch applied.
It seems I added a commit a while back that might have fixed the same issue. I'm not sure if it fully fixes the issue you were having or how much my patch overlaps with yours. I regret not writing a proper commit message now...
I don't think that commit could have an effect. The assertion that was triggered (are you perhaps compiling with NDEBUG?) was in find_last_newline
as it searched backward for a '\n'
and encountered a U+0000 byte.
Ah, I remember why I made that commit now. It was to prevent error->position.column - 1
wrapping around to SIZE_MAX
if error->position.column
is zero. So it's not related to your ace07d0 commit.
... are you perhaps compiling with
NDEBUG
No. I haven't been using NDEBUG
even for release builds.
I guess I can try writing a unit test. I haven't used gtest before, but it looks reasonably simple.
I haven't used gtest before, but it looks reasonably simple.
I just spent some time converting most of the C++ unit tests to C. I'm going to throw out the gtest/C++ dependency altogether after I convert the rest.
Edit: All the parser tests in the lua-gumbo repo are in C now.
@craigbarnes Why not incorporate SETUP()
and TEARDOWN()
into TEST_F
rather than writing them on every test that needs them. Those that don't just define them to empty.
Something like this (untested)
#define TEST_F(group, name) \
static inline void group ## name ## _inner(void); \
static void CONSTRUCTOR group ## name(void) { \
SETUP(); \
group ## name ## _inner(); \
TEARDOWN(); \
} \
void group ## name ## _inner(void)
I think this will also work with failing assertions which return without calling TEARDOWN()
right now.
The ASSERT_
macros also have issues with side-effects. I think the canonical way to deal with this is to have each macro expand to a call to a function that passes __FILE__
, __LINE__
, and the assertion to deal with that.
Here's some code that will allegedly work with visual studio for constructors. https://stackoverflow.com/a/2390626
Why not incorporate
SETUP()
andTEARDOWN()
intoTEST_F
rather than writing them on every test that needs them
I thought about that, but I wanted to just convert everything first before getting too complex with the macros. I like your idea though, but it doesn't seem to work as-is. If it's possible to do something like that without too much extra contortion then I'm all for it.
The
ASSERT_
macros also have issues with side-effects. I think the canonical way to deal with this is to have each macro expand to a call to a function that passes__FILE__
,__LINE__
, and the assertion to deal with that.
I'm working on this at the moment.
Here's some code that will allegedly work with visual studio for constructors
I came across that thread too. I didn't attempt to incorporate it because I don't have any Windows machines to test with, except for the Appveyor CI setup.
Addressed by #84!