commoncrawl/ia-web-commons

WAT/WET generator performance improvements

sebastian-nagel opened this issue · 13 comments

Try to improve the performance of WAT/WET generator. The results of a profiler run (on 3abab54 using async-profiler) shows that most time is spent for

  • HTML parsing (by htmlparser.org) - including reading, verifying, decompressing the WARC data
  • JSON serialization
  • regex matching

weat_prof
(interactive SVG.zip)

The regex seems to be called from ExtractingParseObserver.handleTextNode() and that class is responsible for a significant amount of the time spent in its handleTextNode, handleTagOpen, & handleTagClose methods.

I suspect simply using a pre-compiled pattern In lines like this:

	String t = text.getText().replaceAll("\\s+", " ");

could have a significant impact on performance since it's used so much, but the whole class could probably use a review.

Hi Tom, yes, definitely. This fix (3abab54) has been already used for the profiling (sorry, the fix wasn't yet merged from our production into our master branch). Before indeed a non-trivial amount of time has been spent for compiling the regexes (2.24% and some more):
WAT/WET profile before optimizing String.replaceAll()
(interactive SVG.zip)
Thanks! I'll do the merges soon and also open PRs to push it upstream to iipc/webarchive-commons.

Yes, having source code which matched the profile would probably make the analysis easier. ;)

I've added the commit ID 3abab54 along with the profile - sorry, easy to overlook and unfortunately, it wasn't even to master. But it's now committed to master as well.

Thanks! Does the HEAD of master currently represent what was profiled?

The whitespace replaceAll is still taking over 5% of the time and I question it's usefulness since it's doing only the most basic whitespace compression (and only for Java's rudimentary idea of whitespace).

Yes, the current master is almost identical to 3abab54 which has been profiled (first image). There are differences, but they only affect dependencies and how methods therein are called.

Even normalizing the ASCII whitespace is necessary because it is used HTML source code formatting. The string values in WAT files would look ugly otherwise. One example:

*** 110205,110209 ****
              {
                "path": "A@/href",
!               "text": "Fútbol internacional",
                "url": "/deportes/futbol-internacional"
              },
--- 110141,110145 ----
              {
                "path": "A@/href",
!               "text": "Fútbol\n                                      internacional",
                "url": "/deportes/futbol-internacional"
              },
***************

The corresponding HTML snippet is:

...<li class="mob-menu-item"><a href="/deportes/futbol-internacional">Fútbol
                                      internacional </a>

So just removing slow code segments isn't the option here. But the method handleTextNode(...) can be definitely optimized:

  • in CC's fork the text used for the WET is cleaned considering Unicode whitespace
  • but to extract string values for the WAT the code is still the original one
    In fact that's one of the points to be resolved when pushing the WET extraction code upstream.

The HTML parser is already looking at every character, so that seems like it should be the right place to do this. Parts of the parser can apparently do inter-word space collapsing according to the HTML spec (Unicode whitespace doesn't appear relevant), but I need to research more to see if this is exposed someplace where we can use it.

http://htmlparser.sourceforge.net/javadoc/org/htmlparser/beans/StringBean.html#mCollapse

Hello Sebastian + Tom,

If it's welcome, I can look at whether the JSON serialization of MetaData can be improved, which should speed up WAT generation. I suspect there is room for improvement as:

  • according to the pom, it's using Douglas Crockford's reference Java implementation for JSON, which was not designed for performance
  • moreover, it's using an old version, 20131018, but the current version is 20180813

I can think of three approaches to speed things up. I can prototype them and do some cursory regression/performance testing, if you can recommend the best way to validate the change. eg is there an existing test harness / performance harness?

In increasing order of risk / effort / reward:

  1. bump org.json:json from 20131018 to 20180813. I'm skeptical it would speed things up much, as comparing the old code for JSONObject to the current code, there are still some unfortunate practices, eg quote which gets called repeatedly (and often with the same set of keys) does allocations and synchronizations.

  2. keep org.json:json, but use it only for building up the MetaData object via .put and .get operations on JSONObject. Use a different serializer for generating the JSON. If this is designed to run in Hadoop, I'd suggest using the Jackson serializer that is distributed with the Hadoop environment.

  3. drop org.json:json since at this point it's just a glorified wrapper around a hashmap.

(3) feels very high risk unless there's a strong test suite, so I mention it for completeness only; I doubt it'd be worth pursuing.

Thanks for any pointers you can offer!

Hi @cldellow, I would opt for 3 because

  1. although there is no test suite at all which verifies the WAT output, it would be easy to diff a set of WAT files written with another JSON libs. Ideally, there should be no differences, except maybe the ordering of items in maps.
  2. to add a small but strong unit test would be already a step forward
  • the WAT format description isn't very precise about the gory details
  • "a glorified wrapper around a hashmap" - just brings the first serious test case to my mind: HTTP allows repeated headers (same name), but the hash map in the JSON lib may not. Are repeated HTTP headers preserved?
  1. the license of the org.json package is considered to be non-free

I agree with dropping the Crockford library for both performance and license reasons.

Here's a relatively recent survey of serialization/deserialization performance for a number of different packages, including Jackson. https://interviewbubble.com/performance-comparison-of-json-libraries-jackson-vs-gson-vs-fastjson-vs-json-simple-vs-jsonp/

Great, thanks for the feedback. I'll pursue the more aggressive approach, including adding some tests and using jsoniter (which is MIT licensed, so should be fine to include in an Apache 2 project).

Thanks @sebastian-nagel, @tfmorris. I've opened #16 with a first pass. To avoid highjacking this thread, more discussion can happen on that PR.

Opened upstream PR: iipc#84. Thanks, everybody!