serokell/xrefcheck

Parallelize file parsing

YuriRomanowski opened this issue · 2 comments

Clarification and motivation

This topic is a part of #221.
After we read file contents, we should perform parsing of the files, which is (in theory) pure action and can be parallelized.
But we use C library under the hood, so the parallelization may be tricky. Here we can try some approaches and discuss results.

Acceptance criteria

  • Some approaches of parallelization are tried
  • We decided how to handle foreign calls during parsing for parallelization
  • Speedup is obtained and proved with measurements

I uploaded some commits where different variations of xrefcheck can be load-tested (in branch YuriRomanowski/#247-parallelize-file-parsing-scaffolding):

  • Original version (from master) with lazy readFile: 2a959d0
  • Replace lazy readFile with strict one: b3368c3
  • Force reading files and then process them in parallel using Eval monad: a1d5f56
  • Force reading files and then process them using mapConcurrently: 8f65374
    The latter two ones produce similar results.

Thanks for this investigation!

I tried, and from what I can see:

  • Repo scanning time is not extremely different in all of these scenarios (0.9s / 0.7s / 0.5s / 0.5s)
  • My impression was that in the given load testing there was simply no space for parallelization (this is what we saw on this picture. Sparks tab shows that a few sparks were bind to different cores, but most of them went to one core, probably simply because parsing was fast enough to process it all.
  • I tried to create 4 dummy markdown files, 50Kb each, and sparks solution showed 4 cores being used.

(the selected area corresponds to repo scanning time)
Screenshot from 2023-01-31 21-39-58

Although I'm not exactly sure why "Activity" graph at the top shows so few CPU feed.