jgm/djot.js

Get faster

Opened this issue · 23 comments

jgm commented

Benchmarks are currently running about 3X slower than for commonmark.js. It would be good to understand why and narrow the gap.

jgm commented

Funny thing is that when I benchmark the commonmark.js CLI against the djot one, the djot one is always slightly faster. So I am not sure how to interpret these results. It could be that there's a bit more startup overhead in each conversion for djot (perhaps because of how parser objects are created) and that this predominates in repeated tests parsing very small texts. When longer texts are using, the difference in startup overhead gets swamped and djot comes out ahead.

It may be that there's some low hanging fruit to be plucked, but I'd probably need guidance from someone who is more experienced in optimizing JS.

As a data-point, for me djot takes about 1.75x more time than common mark when rendering a 3.5mb file.

Here's a flamegraph for djot, haven't look at it yet: https://share.firefox.dev/3IuG1qg

jgm commented

This is using it with node, or in the browser?

With node

This part looks potentially optimizable:

    for (let i = this.firstpos; i <= this.lastpos; i++) {
      if (this.matches[i]) {

If I understand this correctly, here we go through every "char" position and do a hashmap lookup. I think the map should be pretty sparse, so it'd be more efficient to convert matches to array and sort them by pos rather than enumerating every pos

jgm commented

this.matches actually is an array -- though it's a sparse array and perhaps that's implemented just as a hash map.
Anyway, the reason it needs to be indexed by pos is that we sometimes override an earlier match, e.g. * might first be categorized as str but then later +emph.

If we just used an array, we could perhaps just add the +emph with the same position later in the array, then sort the array, then remove the first of every pair of adjacent elements with the same startpos, or something like that.

Yeah, there's some water to squeeze out this stone: if I replay array with Map, I get 10% win. Though, I commented out the code that handles hardline breaks, as that uses the last element of the array, and we don't have that with Map.

I am wondering: can we refer to an earlier match not by codepoint offset, but by its position in the matches array? That is, we accumulate events in an array, events are generally .sorted by their startpos, but, if we change our mind about some past events, we go to their position in an array and replace them with tombstones.

I use a similar device in rust-analyzer:

https://github.com/rust-lang/rust-analyzer/blob/80cabf726068187d8686b5ccf37aac484da84904/crates/parser/src/event.rs#L20-L81

Parsing events are accumulated into a flat array, but events can be tombstones, or they can refer to future events.

(the backstory here is that, while brute-force replacement of array with map speeds things up, getMaches still is the hottest single-indentifiedable place in the profile, as we replaced a lot of negative hash map lookups with array allocation and sorting; so I am wondering if we can just be smarter with dumb dense arrays here)

jgm commented

Yes, this sounds like a plausible idea.

The main change would be adding to Opener a field for the index into the matches array (since we wouldn't be able to use the startpos for this). This would have to be added in addOpener. It would also be a good idea to incorporate adding the "default match" to addOpener; this would ensure we get the right index. Currently we do addOpener and then addMatch to add the default match.

Other cases to handle:

  • currently we delete earlier "str" matches for $ or $$ before verbatim (since this turns into math). (lines 217-222). This should be doable with the proposed approach; we'd have to talk back through the array a bit and change these to tombstones. But this is going to be infrequent and wouldn't require walking back more than a few elements.
  • line 408 we use strMatches to convert all the matches between start and endpos for reference to str. This should be doable with the proposed approach; we can just walk back through the array til we get to the start position.
  • line 490 similarly but with destination

And then we need to figure out what the tombstone is and how to strip it out. It could just be null, I suppose, and we could filter the array on !== null before sorting? But maybe v8 would prefer if we used the same type throughout, in which case we could have annot = "tombstone".

[I can try this.]

jgm commented

I've made a stab at this in #24. I think the code is more elegant (and shorter) now. But two tests fail and I haven't yet gotten to the bottom of it.

Looked at the perf profile again this morning. Sadly, nothing more stands out: at this (function) level of granularity, the profile looks mostly flat, it's "stuff takes time" basically. I think we'll need some sort of line profiler to get further along, but quick googling didn't show how to get that with JS.

Nonetheless:

  • I see a bunch of time taken by Builtins_KeyedLoadIC_Megamorphic. I have no idea what calls this function, but the "megamorphic" in the name seems to imply that we hit deoptimized path somewhere. Specific "somewhere" is missing, so this is unlikely to be actionable.
  • I see v8::internal::StringTable::Data::TryStringToIndexOrLookupExisting. I think that one comes from doing lookups in the handlers array in parse.js. It seems that what is happening here is genuine hash-table lookup -- we hash the tag name, etc. In particular,(hypothesis) because we do things like "-" + annotation, our annotations aren't really interned-at-compile-time symbols, but actual runtime strings.
  • The array sorting from getMatches is visible in the profile. It seems that in principle we can avoid the need to sort.
  • accumulatedText.join("") is visible in the profile, would probably be faster if we += strings directly
jgm commented

I adjusted the commonmark.js profiling so it matches what we're doing (parsing only, constructing object inside the profiling loop), only with cm versions of our dj files.

The differences are rather astounding, though I must say I don't understand why the observed differences on real-world conversions are not very large:

block-bq-flat.md 263,041 ops/sec ±0.21% (100 runs sampled)
block-bq-nested.md 146,306 ops/sec ±0.18% (100 runs sampled)
block-code.md 555,951 ops/sec ±0.13% (100 runs sampled)
block-fences.md 600,999 ops/sec ±0.12% (99 runs sampled)
block-heading.md 283,913 ops/sec ±0.18% (100 runs sampled)
block-hr.md 437,882 ops/sec ±0.14% (100 runs sampled)
block-html.md 154,155 ops/sec ±0.21% (99 runs sampled)
block-lheading.md 386,463 ops/sec ±0.16% (100 runs sampled)
block-list-flat.md 34,534 ops/sec ±0.16% (101 runs sampled)
block-list-nested.md 34,284 ops/sec ±0.19% (101 runs sampled)
block-ref-flat.md 85,683 ops/sec ±0.22% (94 runs sampled)
block-ref-nested.md 73,614 ops/sec ±0.18% (95 runs sampled)
inline-autolink.md 92,900 ops/sec ±0.16% (101 runs sampled)
inline-backticks.md 451,831 ops/sec ±0.16% (100 runs sampled)
inline-em-flat.md 120,869 ops/sec ±0.21% (100 runs sampled)
inline-em-nested.md 168,278 ops/sec ±0.17% (101 runs sampled)
inline-em-worst.md 170,375 ops/sec ±0.17% (101 runs sampled)
inline-entity.md 193,644 ops/sec ±0.19% (98 runs sampled)
inline-escape.md 131,929 ops/sec ±0.15% (101 runs sampled)
inline-html.md 57,738 ops/sec ±0.18% (100 runs sampled)
inline-links-flat.md 52,047 ops/sec ±0.19% (97 runs sampled)
inline-links-nested.md 104,131 ops/sec ±0.16% (101 runs sampled)
inline-newlines.md 187,799 ops/sec ±0.25% (100 runs sampled)
lorem1.md 72,545 ops/sec ±0.16% (100 runs sampled)
README.md 7,623 ops/sec ±0.24% (99 runs sampled)

And here is djot.js:

parse block-bq-flat.dj x 97,828 ops/sec ±1.66% (88 runs sampled)
parse block-bq-nested.dj x 48,389 ops/sec ±0.39% (101 runs sampled)
parse block-code.dj x 214,310 ops/sec ±0.43% (100 runs sampled)
parse block-fences.dj x 205,352 ops/sec ±0.46% (96 runs sampled)
parse block-heading.dj x 70,228 ops/sec ±0.38% (101 runs sampled)
parse block-hr.dj x 77,164 ops/sec ±0.39% (95 runs sampled)
parse block-list-flat.dj x 10,230 ops/sec ±0.69% (99 runs sampled)
parse block-list-nested.dj x 13,184 ops/sec ±0.88% (99 runs sampled)
parse block-ref-flat.dj x 20,696 ops/sec ±0.33% (99 runs sampled)
parse block-ref-nested.dj x 15,024 ops/sec ±0.26% (100 runs sampled)
parse inline-autolink.dj x 36,221 ops/sec ±0.30% (98 runs sampled)
parse inline-backticks.dj x 115,746 ops/sec ±0.37% (102 runs sampled)
parse inline-em-flat.dj x 35,967 ops/sec ±0.26% (99 runs sampled)
parse inline-em-nested.dj x 37,750 ops/sec ±0.25% (98 runs sampled)
parse inline-em-worst.dj x 43,255 ops/sec ±0.32% (98 runs sampled)
parse inline-escape.dj x 34,745 ops/sec ±0.50% (99 runs sampled)
parse inline-links-flat.dj x 18,096 ops/sec ±0.35% (100 runs sampled)
parse inline-links-nested.dj x 15,281 ops/sec ±0.31% (100 runs sampled)
parse lorem1.dj x 18,738 ops/sec ±0.48% (99 runs sampled)
parse readme.dj x 2,770 ops/sec ±0.57% (100 runs sampled)
renderHTML readme x 17,337 ops/sec ±0.09% (102 runs sampled)
jgm commented

PS. My memory of optimizing the commonmark.js parser was that it was largely a matter of doing things that made the v8 optimizer happy -- so, it wasn't like optimizing a regular program, where you look for hot spots (the sort of thing we've been doing), but more about doing magic tricks that suddenly made a huge difference.

jgm commented

Tried replacing accumulatedText with a string, but no improvement, slight slowdown in some cases.

jgm commented

I think your first two findings are the most promising. Something is megamorphic, and if we figure out what's triggering that it could have a big effect. Note that in commonmark.js, we just use a Node object that, essentially, has all the fields anything will need (stringContent, destination, title, level, fenceChar, etc.). That ensures that all the objects have the same "shape," which apparently makes the optimizer happy. In djot.js, by contrast, we have AST nodes with different shapes, and maybe this makes any function that takes multiple kinds of AST nodes as parameters megamorphic. I'm not sure how to fix this, though; we want different shapes. One possible option to explore would be keeping the things that vary under a data field, so we have

{ tag: "heading",
  children: ...,
  attributes: ...,
  data: { level: 5 }
}

The second thing, avoiding things ilke "+" + annotation, seems worth exploring for sure.
EDIT: The only place I could find where we do "+" + annotation is in betweenMatches in inlines.ts. I tried removing that by giving this function parameters for both open and close annotation, and just hard-coding "+emph" and "-emph" when calling the function. This did not yield any real difference in benchmarsk.

jgm commented

OK, I managed to remove the need for the .sort() in inlines.
This gives us the following benchmarks (around a 20% improvement in inline heavy text):

parse block-bq-flat.dj x 111,995 ops/sec ±0.64% (97 runs sampled)
parse block-bq-nested.dj x 54,569 ops/sec ±1.51% (97 runs sampled)
parse block-code.dj x 224,163 ops/sec ±0.68% (98 runs sampled)
parse block-fences.dj x 210,808 ops/sec ±0.46% (100 runs sampled)
parse block-heading.dj x 77,986 ops/sec ±0.43% (101 runs sampled)
parse block-hr.dj x 83,618 ops/sec ±0.38% (97 runs sampled)
parse block-list-flat.dj x 11,321 ops/sec ±0.54% (101 runs sampled)
parse block-list-nested.dj x 14,747 ops/sec ±0.52% (99 runs sampled)
parse block-ref-flat.dj x 24,116 ops/sec ±0.36% (98 runs sampled)
parse block-ref-nested.dj x 17,338 ops/sec ±0.27% (100 runs sampled)
parse inline-autolink.dj x 44,454 ops/sec ±0.33% (101 runs sampled)
parse inline-backticks.dj x 125,638 ops/sec ±0.37% (101 runs sampled)
parse inline-em-flat.dj x 41,196 ops/sec ±0.28% (96 runs sampled)
parse inline-em-nested.dj x 41,667 ops/sec ±0.24% (101 runs sampled)
parse inline-em-worst.dj x 48,444 ops/sec ±0.32% (100 runs sampled)
parse inline-escape.dj x 39,112 ops/sec ±0.35% (96 runs sampled)
parse inline-links-flat.dj x 22,038 ops/sec ±0.50% (99 runs sampled)
parse inline-links-nested.dj x 17,464 ops/sec ±0.31% (101 runs sampled)
parse lorem1.dj x 22,343 ops/sec ±0.36% (101 runs sampled)
parse readme.dj x 3,100 ops/sec ±0.57% (100 runs sampled)
renderHTML readme x 17,336 ops/sec ±0.08% (98 runs sampled)
jgm commented

In a real-world test with converted versions of the same document, I'm getting:

djot.js: 2874 bytes/ms
commonmark.js: 3780 bytes/ms

jgm commented

After a few more optimizations, I've improved it a bit more:

parse block-bq-flat.dj x 116,613 ops/sec ±0.58% (99 runs sampled)
parse block-bq-nested.dj x 57,481 ops/sec ±1.44% (96 runs sampled)
parse block-code.dj x 282,016 ops/sec ±0.48% (98 runs sampled)
parse block-fences.dj x 272,059 ops/sec ±0.49% (95 runs sampled)
parse block-heading.dj x 86,364 ops/sec ±0.65% (97 runs sampled)
parse block-hr.dj x 89,644 ops/sec ±0.38% (99 runs sampled)
parse block-list-flat.dj x 12,203 ops/sec ±0.54% (98 runs sampled)
parse block-list-nested.dj x 15,496 ops/sec ±0.50% (100 runs sampled)
parse block-ref-flat.dj x 24,117 ops/sec ±0.35% (100 runs sampled)
parse block-ref-nested.dj x 17,079 ops/sec ±0.26% (98 runs sampled)
parse inline-autolink.dj x 44,118 ops/sec ±0.32% (100 runs sampled)
parse inline-backticks.dj x 125,903 ops/sec ±0.35% (100 runs sampled)
parse inline-em-flat.dj x 41,459 ops/sec ±0.30% (100 runs sampled)
parse inline-em-nested.dj x 42,237 ops/sec ±0.30% (96 runs sampled)
parse inline-em-worst.dj x 47,392 ops/sec ±0.32% (97 runs sampled)
parse inline-escape.dj x 39,054 ops/sec ±0.32% (99 runs sampled)
parse inline-links-flat.dj x 22,308 ops/sec ±0.36% (101 runs sampled)
parse inline-links-nested.dj x 17,341 ops/sec ±0.30% (99 runs sampled)
parse lorem1.dj x 22,885 ops/sec ±0.36% (98 runs sampled)
parse readme.dj x 3,262 ops/sec ±0.56% (98 runs sampled)
renderHTML readme x 17,430 ops/sec ±0.05% (97 runs sampled)
jgm commented

After 819f8f3 we're a bit faster:

parse block-bq-flat.dj x 122,598 ops/sec ±0.61% (100 runs sampled)
parse block-bq-nested.dj x 59,940 ops/sec ±1.35% (99 runs sampled)
parse block-code.dj x 285,294 ops/sec ±0.52% (99 runs sampled)
parse block-fences.dj x 272,202 ops/sec ±0.49% (100 runs sampled)
parse block-heading.dj x 91,620 ops/sec ±0.45% (97 runs sampled)
parse block-hr.dj x 93,569 ops/sec ±0.37% (97 runs sampled)
parse block-list-flat.dj x 12,692 ops/sec ±0.51% (100 runs sampled)
parse block-list-nested.dj x 15,946 ops/sec ±0.51% (99 runs sampled)
parse block-ref-flat.dj x 26,055 ops/sec ±0.36% (98 runs sampled)
parse block-ref-nested.dj x 18,252 ops/sec ±0.29% (100 runs sampled)
parse inline-autolink.dj x 47,963 ops/sec ±0.32% (98 runs sampled)
parse inline-backticks.dj x 134,984 ops/sec ±0.40% (94 runs sampled)
parse inline-em-flat.dj x 43,427 ops/sec ±0.27% (96 runs sampled)
parse inline-em-nested.dj x 43,967 ops/sec ±0.38% (100 runs sampled)
parse inline-em-worst.dj x 51,468 ops/sec ±0.51% (96 runs sampled)
parse inline-escape.dj x 42,720 ops/sec ±0.33% (100 runs sampled)
parse inline-links-flat.dj x 24,501 ops/sec ±0.38% (99 runs sampled)
parse inline-links-nested.dj x 18,910 ops/sec ±0.31% (101 runs sampled)
parse lorem1.dj x 24,809 ops/sec ±0.34% (101 runs sampled)
parse readme.dj x 3,486 ops/sec ±0.58% (99 runs sampled)
renderHTML readme x 17,433 ops/sec ±0.04% (100 runs sampled)
jgm commented

After 1492366 :

parse block-bq-flat.dj x 132,922 ops/sec ±0.63% (97 runs sampled)
parse block-bq-nested.dj x 64,055 ops/sec ±1.37% (93 runs sampled)
parse block-code.dj x 305,707 ops/sec ±0.67% (98 runs sampled)
parse block-fences.dj x 289,597 ops/sec ±0.52% (97 runs sampled)
parse block-heading.dj x 96,519 ops/sec ±0.59% (93 runs sampled)
parse block-hr.dj x 100,535 ops/sec ±0.73% (99 runs sampled)
parse block-list-flat.dj x 12,279 ops/sec ±0.92% (99 runs sampled)
parse block-list-nested.dj x 15,425 ops/sec ±0.59% (99 runs sampled)
parse block-ref-flat.dj x 27,987 ops/sec ±0.54% (93 runs sampled)
parse block-ref-nested.dj x 19,508 ops/sec ±0.32% (96 runs sampled)
parse inline-autolink.dj x 49,090 ops/sec ±0.32% (100 runs sampled)
parse inline-backticks.dj x 139,996 ops/sec ±0.54% (100 runs sampled)
parse inline-em-flat.dj x 46,086 ops/sec ±0.30% (96 runs sampled)
parse inline-em-nested.dj x 47,134 ops/sec ±0.29% (96 runs sampled)
parse inline-em-worst.dj x 57,647 ops/sec ±0.35% (99 runs sampled)
parse inline-escape.dj x 47,039 ops/sec ±0.34% (100 runs sampled)
parse inline-links-flat.dj x 26,364 ops/sec ±0.39% (101 runs sampled)
parse inline-links-nested.dj x 21,174 ops/sec ±0.33% (98 runs sampled)
parse lorem1.dj x 26,566 ops/sec ±0.38% (100 runs sampled)
parse readme.dj x 3,529 ops/sec ±0.57% (99 runs sampled)
renderHTML readme x 17,457 ops/sec ±0.04% (101 runs sampled)
jgm commented

After ddb155a

parse block-bq-flat.dj x 131,276 ops/sec ±0.60% (96 runs sampled)
parse block-bq-nested.dj x 63,742 ops/sec ±1.39% (95 runs sampled)
parse block-code.dj x 301,477 ops/sec ±0.64% (100 runs sampled)
parse block-fences.dj x 286,620 ops/sec ±0.50% (97 runs sampled)
parse block-heading.dj x 95,016 ops/sec ±0.50% (99 runs sampled)
parse block-hr.dj x 103,091 ops/sec ±0.37% (102 runs sampled)
parse block-list-flat.dj x 12,714 ops/sec ±0.51% (101 runs sampled)
parse block-list-nested.dj x 15,965 ops/sec ±0.53% (100 runs sampled)
parse block-ref-flat.dj x 28,886 ops/sec ±0.38% (99 runs sampled)
parse block-ref-nested.dj x 19,939 ops/sec ±0.27% (102 runs sampled)
parse inline-autolink.dj x 50,233 ops/sec ±0.35% (100 runs sampled)
parse inline-backticks.dj x 145,375 ops/sec ±0.38% (98 runs sampled)
parse inline-em-flat.dj x 46,144 ops/sec ±0.27% (99 runs sampled)
parse inline-em-nested.dj x 46,735 ops/sec ±0.27% (98 runs sampled)
parse inline-em-worst.dj x 57,474 ops/sec ±0.33% (99 runs sampled)
parse inline-escape.dj x 47,054 ops/sec ±0.35% (99 runs sampled)
parse inline-links-flat.dj x 26,326 ops/sec ±0.37% (101 runs sampled)
parse inline-links-nested.dj x 21,142 ops/sec ±0.32% (101 runs sampled)
parse lorem1.dj x 26,505 ops/sec ±0.37% (98 runs sampled)
parse readme.dj x 3,523 ops/sec ±0.57% (99 runs sampled)
renderHTML readme x 29,028 ops/sec ±0.04% (98 runs sampled)
jgm commented

After 57c81ea

parse block-bq-flat.dj x 132,939 ops/sec ±0.65% (94 runs sampled)
parse block-bq-nested.dj x 64,598 ops/sec ±1.51% (92 runs sampled)
parse block-code.dj x 324,897 ops/sec ±0.68% (100 runs sampled)
parse block-fences.dj x 309,821 ops/sec ±0.53% (97 runs sampled)
parse block-heading.dj x 98,134 ops/sec ±0.49% (95 runs sampled)
parse block-hr.dj x 103,908 ops/sec ±0.39% (100 runs sampled)
parse block-list-flat.dj x 12,846 ops/sec ±0.54% (98 runs sampled)
parse block-list-nested.dj x 16,011 ops/sec ±0.77% (98 runs sampled)
parse block-ref-flat.dj x 29,263 ops/sec ±0.54% (97 runs sampled)
parse block-ref-nested.dj x 19,837 ops/sec ±0.57% (97 runs sampled)
parse inline-autolink.dj x 51,005 ops/sec ±0.33% (99 runs sampled)
parse inline-backticks.dj x 151,295 ops/sec ±0.38% (97 runs sampled)
parse inline-em-flat.dj x 46,415 ops/sec ±0.27% (102 runs sampled)
parse inline-em-nested.dj x 46,887 ops/sec ±0.26% (102 runs sampled)
parse inline-em-worst.dj x 57,678 ops/sec ±0.33% (101 runs sampled)
parse inline-escape.dj x 47,065 ops/sec ±0.35% (102 runs sampled)
parse inline-links-flat.dj x 26,569 ops/sec ±0.41% (101 runs sampled)
parse inline-links-nested.dj x 21,242 ops/sec ±0.33% (100 runs sampled)
parse lorem1.dj x 26,592 ops/sec ±0.36% (101 runs sampled)
parse readme.dj x 3,577 ops/sec ±0.56% (100 runs sampled)
renderHTML readme x 28,820 ops/sec ±0.05% (98 runs sampled)
jgm commented

After upgrading to node v16, times are slower. I note this here so we don't think changes in the code have caused the slower performance:

parse block-bq-flat.dj x 118,939 ops/sec ±0.77% (99 runs sampled)
parse block-bq-nested.dj x 56,804 ops/sec ±0.68% (100 runs sampled)
parse block-code.dj x 287,796 ops/sec ±0.58% (91 runs sampled)
parse block-fences.dj x 279,181 ops/sec ±0.58% (98 runs sampled)
parse block-heading.dj x 91,807 ops/sec ±0.92% (94 runs sampled)
parse block-hr.dj x 90,520 ops/sec ±0.11% (96 runs sampled)
parse block-list-flat.dj x 11,376 ops/sec ±2.63% (94 runs sampled)
parse block-list-nested.dj x 14,992 ops/sec ±0.84% (99 runs sampled)
parse block-ref-flat.dj x 26,437 ops/sec ±0.58% (97 runs sampled)
parse block-ref-nested.dj x 18,655 ops/sec ±0.51% (100 runs sampled)
parse inline-autolink.dj x 48,486 ops/sec ±0.59% (98 runs sampled)
parse inline-backticks.dj x 139,860 ops/sec ±0.54% (99 runs sampled)
parse inline-em-flat.dj x 51,949 ops/sec ±0.45% (97 runs sampled)
parse inline-em-nested.dj x 51,112 ops/sec ±0.51% (98 runs sampled)
parse inline-em-worst.dj x 51,228 ops/sec ±0.45% (96 runs sampled)
parse inline-escape.dj x 42,468 ops/sec ±0.48% (99 runs sampled)
parse inline-links-flat.dj x 24,437 ops/sec ±1.18% (99 runs sampled)
parse inline-links-nested.dj x 18,685 ops/sec ±0.49% (98 runs sampled)
parse lorem1.dj x 23,750 ops/sec ±0.48% (97 runs sampled)
parse readme.dj x 3,275 ops/sec ±0.94% (96 runs sampled)
renderHTML readme x 29,720 ops/sec ±0.07% (98 runs sampled)