Parallelize file parsing

Question

Parallelize file parsing

YuriRomanowski opened this issue 2 years ago · 2 comments

Clarification and motivation

This topic is a part of #221.
After we read file contents, we should perform parsing of the files, which is (in theory) pure action and can be parallelized.
But we use C library under the hood, so the parallelization may be tricky. Here we can try some approaches and discuss results.

Acceptance criteria

Some approaches of parallelization are tried
We decided how to handle foreign calls during parsing for parallelization
Speedup is obtained and proved with measurements

Answer 1 · 2022-12-16T15:26:27.000Z

I uploaded some commits where different variations of xrefcheck can be load-tested (in branch YuriRomanowski/#247-parallelize-file-parsing-scaffolding):

Original version (from master) with lazy readFile: 2a959d0
Replace lazy readFile with strict one: b3368c3
Force reading files and then process them in parallel using Eval monad: a1d5f56
Force reading files and then process them using mapConcurrently: 8f65374
The latter two ones produce similar results.

Answer 2 · 2023-01-31T18:52:48.000Z

Thanks for this investigation!

I tried, and from what I can see:

Repo scanning time is not extremely different in all of these scenarios (0.9s / 0.7s / 0.5s / 0.5s)
My impression was that in the given load testing there was simply no space for parallelization (this is what we saw on this picture. Sparks tab shows that a few sparks were bind to different cores, but most of them went to one core, probably simply because parsing was fast enough to process it all.
I tried to create 4 dummy markdown files, 50Kb each, and sparks solution showed 4 cores being used.

(the selected area corresponds to repo scanning time)

Although I'm not exactly sure why "Activity" graph at the top shows so few CPU feed.