rubys/nokogumbo

Incorporate gumbo fixes from lua-gumbo

stevecheckoway opened this issue · 21 comments

lua-gumbo has made a bunch of improvements to gumbo. We should incorporate their fixes!

Note: the following pairs are probably best merged into single commits:

These should be easy to cherry-pick without requiring any earlier commits:

The rest probably need to be done bottom to top. I threw out the original Python generator scripts and autotools build system, so a few commits might also need adjusting to account for that too.

rubys commented

Any chance of a shared common codebase?

To be honest, probably not. After seeing some of the conflicting feature requests and ABI stability issues the original author had to deal with, I'm glad to be maintaining just an internal fork and not a shared library.

FWIW, there's really not that much substance in the above commits. The larger ones are purely for performance. The rest are mostly trivial. IMO the library was in pretty good shape by the time it was abandoned. The only part noticeably buggier than the rest seems to be error.c, but I'm not using that in lua-gumbo.

@craigbarnes If you ever decide to use the error portion, I had a few fixes

  • c685435 Handle zero bytes in the input without crashing and add missing parse error.
  • 60fa747 Improve the error API slightly by adding explicit lengths.

The error API could be improved further (and more error messages could be implemented), but it's pretty low priority.

rubys commented

@craigbarnes I probably shouldn't have used the words shared library. I didn't mean that in the runtime sense. I meant shared source.

Let me ask the question a different way. Are you willing to accept patches to the gumbo parser even if the fixes are to portions of the code that you aren't currently using (e.g., the error API)?

Previously, nokogumbo got access to the source code of gumbo-parser as a git submodule. Is there any reason that nokogumbo couldn't have lua-gumbo as a git submodule, even if the only portion of the code it accesses from that tree is the lib subdirectory?

Upside to lua-gumbo: you may get more fixes.
Downside to lua-gumo: some of those fixes may be to code that you don't use, or conceivably could even break lua-gumbo.

Upside to nokogumbo: picking up fixes from lua-gumbo would be as simple as updating the submodule.
Downside to nokogumbo: lua-gumbo may not accept fixes, or fixes could conceivably break nokogumbo.

Alternatives:

  1. we create a joint fork of gumbo-parser and shed the parts we don't need and both projects import that codebase.
  2. we put lua-gumbo and nokogumbo into a common repository and the build and test process produces and verifies two outputs. There would be no runtime dependency Ruby by the Lua output, and conversely no runtime dependency on Lua by the Ruby output.

Let me ask the question a different way. Are you willing to accept patches to the gumbo parser even if the fixes are to portions of the code that you aren't currently using (e.g., the error API)?

Yes. I intend to eventually clean up and use the error API anyway. I don't foresee removing anything from the library, besides the already removed custom allocator support.

Previously, nokogumbo got access to the source code of gumbo-parser as a git submodule. Is there any reason that nokogumbo couldn't have lua-gumbo as a git submodule

There's no reason you couldn't. I'm fine with people using the fork via whatever method they prefer, so long as they're ok with having no guarantees of API stability.

Downside to lua-gumo: some of those fixes may be to code that you don't use, or conceivably could even break lua-gumbo.

The former is fine by me. I won't merge anything that will break the Lua bindings though.

Downside to nokogumbo: lua-gumbo may not accept fixes, or fixes could conceivably break nokogumbo.

I'll accept anything that improves the correctness of the existing code. If I have to change the API, I'll do so as minimally as possible and make sure it doesn't go unnoticed.

  1. we create a joint fork of gumbo-parser and shed the parts we don't need and both projects import that codebase.

That would just invite the question "why not a shared library"? The only thing I'm adamantly against is maintaining a shared library.

  1. we put lua-gumbo and nokogumbo into a common repository and the build and test process produces and verifies two outputs.

It seems like that would just clutter the git history of both projects. Depending on what happens down the line, I may change my mind, but for now the option that sounds most agreeable to me is nokogumbo using lua-gumbo as a submodule.

By the way, if you opt for the submodule option, you'll probably want to handle the new GumboOutput::status field (specifically, when it's equal to GUMBO_STATUS_TREE_TOO_DEEP).

If you ever decide to use the error portion, I had a few fixes

c685435 Handle zero bytes in the input without crashing and add missing parse error.

@stevecheckoway Is this specifically triggered by calling gumbo_caret_diagnostic_to_string()? I'll pull your fix when I get some time to write a few test cases.

The error API could be improved further (and more error messages could be implemented), but it's pretty low priority.

Yeah, it needs a complete overhaul really. In the original libgumbo, it wasn't installed by "make install" and was considered a private/internal header. How much of the error API is exposed in nokogumbo?

@craigbarnes GUMBO_STATUS_OUT_OF_MEMORY seems unused. What's the rationale behind a 400-deep limit for GUMBO_STATUS_TREE_TOO_DEEP?

@stevecheckoway Is this specifically triggered by calling gumbo_caret_diagnostic_to_string()? I'll pull your fix when I get some time to write a few test cases.

Yes, the crash is due to calling gumbo_caret_diagnostic_to_string

How much of the API is exposed in nokogumbo?

None. The errors are exposed to the caller via an array of Nokogiri::XML::SyntaxErrors.

I spent way too long today

  1. writing some code to pull patches from lua-gumbo, change paths (from lib/blah to gumbo-parser/src/blah) and omit patches to files outside the lib;
  2. reseting nokogumbo's copy of gumbo-parser to the same version you started with;
  3. applying all of the patches in order; and
  4. re-implementing my patches on top of yours ace07d0.

I did this because I had trouble cherry-picking or otherwise merging the patches you mentioned. Did you clang-format the code or something?

It might be easier for you to grab ace07d0 than my previous commits.

GUMBO_STATUS_OUT_OF_MEMORY seems unused.

It is unused at the moment. I envisioned improving OOM handling in the future by using a status code instead of calling abort(), but that hasn't been implemented yet. The original Gumbo author also took this approach in his arena branch, but it was never finished.

What's the rationale behind a 400-deep limit for GUMBO_STATUS_TREE_TOO_DEEP?

  • Preventing google/gumbo-parser#393.
  • Preventing a recursive tree builder overflowing the stack.
  • Preventing possible DoS attack by "<div>".repeat(2^20)
  • No sane, non-malicious document has a tree depth of more than 150 or so.

The limit was originally 512, to match the limit used by WebKit. I reduced it to 400 to work around what seems to be a bug in OpenBSD causing excessive stack usage.

I'll add a field to GumboOptions to make the limit configurable. You could probably increase it if you want -- I'm pretty sure the bug is triggered by the Lua tree builder (gumbo/parse.c). The parser itself seems to handle tree depths of 50,000+ no problem.

I spent way too long today

  1. writing some code to pull patches from lua-gumbo, change paths (from lib/blah to gumbo-parser/src/blah) and omit patches to files outside the lib;
  2. reseting nokogumbo's copy of gumbo-parser to the same version you started with;
  3. applying all of the patches in order; and
  4. re-implementing my patches on top of yours ace07d0.

It seems to have worked pretty well anyway. It took me a lot longer to create the commits ;)

I did this because I had trouble cherry-picking or otherwise merging the patches you mentioned. Did you clang-format the code or something?

Actually I did the opposite. The Gumbo maintainer destroyed the formatting of the code with the abomination known as clang-format and I spent ages manually reformatting it.

It might be easier for you to grab ace07d0 than my previous commits.

Thanks. I'll take a look at that in a while.

This all sounds good to me.

What do you do when the limit is hit? Do you return a partial tree in lua? @rubys, do you have any thoughts about what nokogumbo should do in this case?

What do you do when the limit is hit? Do you return a partial tree in lua?

I just call gumbo_destroy_output() and return an error. I considered returning a partial tree with a boolean field marking it as such, but the possibility of people just ignoring it and treating it like a complete tree seemed too likely. If I were going to make it possible to get the partial tree, I'd probably add some kind of option to explicitly request it beforehand.

We're probably talking about a fraction of 1% edge case here though.

That seems sensible to me.

It might be easier for you to grab ace07d0 than my previous commits.

I couldn't figure out why your test cases wouldn't crash lua-gumbo, even when calling gumbo_caret_diagnostic_to_string() without your patch applied.

It seems I added a commit a while back that might have fixed the same issue. I'm not sure if it fully fixes the issue you were having or how much my patch overlaps with yours. I regret not writing a proper commit message now...

I don't think that commit could have an effect. The assertion that was triggered (are you perhaps compiling with NDEBUG?) was in find_last_newline as it searched backward for a '\n' and encountered a U+0000 byte.

Ah, I remember why I made that commit now. It was to prevent error->position.column - 1 wrapping around to SIZE_MAX if error->position.column is zero. So it's not related to your ace07d0 commit.

... are you perhaps compiling with NDEBUG

No. I haven't been using NDEBUG even for release builds.

I guess I can try writing a unit test. I haven't used gtest before, but it looks reasonably simple.

I haven't used gtest before, but it looks reasonably simple.

I just spent some time converting most of the C++ unit tests to C. I'm going to throw out the gtest/C++ dependency altogether after I convert the rest.

Edit: All the parser tests in the lua-gumbo repo are in C now.

@craigbarnes Why not incorporate SETUP() and TEARDOWN() into TEST_F rather than writing them on every test that needs them. Those that don't just define them to empty.

Something like this (untested)

#define TEST_F(group, name) \
static inline void group ## name ## _inner(void); \
static void CONSTRUCTOR group ## name(void) { \
  SETUP(); \
  group ## name ## _inner(); \
  TEARDOWN(); \
} \
void group ## name ## _inner(void)

I think this will also work with failing assertions which return without calling TEARDOWN() right now.

The ASSERT_ macros also have issues with side-effects. I think the canonical way to deal with this is to have each macro expand to a call to a function that passes __FILE__, __LINE__, and the assertion to deal with that.

Here's some code that will allegedly work with visual studio for constructors. https://stackoverflow.com/a/2390626

Why not incorporate SETUP() and TEARDOWN() into TEST_F rather than writing them on every test that needs them

I thought about that, but I wanted to just convert everything first before getting too complex with the macros. I like your idea though, but it doesn't seem to work as-is. If it's possible to do something like that without too much extra contortion then I'm all for it.

The ASSERT_ macros also have issues with side-effects. I think the canonical way to deal with this is to have each macro expand to a call to a function that passes __FILE__, __LINE__, and the assertion to deal with that.

I'm working on this at the moment.

Here's some code that will allegedly work with visual studio for constructors

I came across that thread too. I didn't attempt to incorporate it because I don't have any Windows machines to test with, except for the Appveyor CI setup.

Addressed by #84!