Slow parse_collection due to I/O synchronization
Opened this issue · 3 comments
Describe the bug
Ingesting plaintext records via stdin seems very slow (12
MiB/sec) even though 60
worker threads are used. I suspect this has to do with https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio
Before I disabled synchronization most threads were idle and I could only parse with 12
MiB/sec. After I disabled sync I was able to ingest with full speed (speed of the decompressor) and all threads were utilized.
Not sure if this is a safe thing to do though..
To Reproduce
Steps to reproduce the behavior:
- Ingest data in plain text format:
find /storage/col-zstd/ -name '*.zst' -exec zstdcat {} \; | pv | ./parse_collection -o /storage/col_pisa_idx/forward -j 60 -f plaintext --stemmer krovetz -b 100000
- Observe speeds reporter by pv of
12
MiB/sec.perf top
shows that the process is I/O bound as most time is spent inlibc::getc
andlibc::ungetc
. Only3
cores are utilized. - Add
std::ios::sync_with_stdio(false);
toparse_collection
and observe speeds go to350
MiB/sec (zstd decompression speed) and all 60 cores are utilized.
Expected behavior
Unclear if this optimization is safe. It appears to be as only one thread ever reads from stdin before work is passed to the different worker threads?
Environment info
Operating System: Ubuntu 20.04
Compiler: g++ (Ubuntu 9.3.0-10ubuntu2) 9.3.0
Thanks @mpetri for reporting. Yes, this should be totally safe. The input is read from stdin
only by the main thread and then batches are delegated to the workers, and these write do different files, after which one thread merges them. So there should be no races.
If you notice any problems, let us know.
After creating an index with this enabled everything looked okay. So you are right it looks like it is safe. The main question would be where to enable this in the program? At the start or only at the beginning of the while loop in the forward_index_builder
I think we should do it in the parse_collection
tool itself. Right now, it would probably not make any difference, but I'd rather not change global state in one of the algorithms, it might lead to some surprising bugs down the road.