pisa-engine/pisa

Slow parse_collection due to I/O synchronization

Opened this issue · 3 comments

Describe the bug
Ingesting plaintext records via stdin seems very slow (12MiB/sec) even though 60 worker threads are used. I suspect this has to do with https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio

Before I disabled synchronization most threads were idle and I could only parse with 12MiB/sec. After I disabled sync I was able to ingest with full speed (speed of the decompressor) and all threads were utilized.

Not sure if this is a safe thing to do though..

To Reproduce
Steps to reproduce the behavior:

  1. Ingest data in plain text format: find /storage/col-zstd/ -name '*.zst' -exec zstdcat {} \; | pv | ./parse_collection -o /storage/col_pisa_idx/forward -j 60 -f plaintext --stemmer krovetz -b 100000
  2. Observe speeds reporter by pv of 12 MiB/sec. perf top shows that the process is I/O bound as most time is spent in libc::getc and libc::ungetc. Only 3 cores are utilized.
  3. Add std::ios::sync_with_stdio(false); to parse_collection and observe speeds go to 350 MiB/sec (zstd decompression speed) and all 60 cores are utilized.

Expected behavior

Unclear if this optimization is safe. It appears to be as only one thread ever reads from stdin before work is passed to the different worker threads?

Environment info
Operating System: Ubuntu 20.04
Compiler: g++ (Ubuntu 9.3.0-10ubuntu2) 9.3.0

Thanks @mpetri for reporting. Yes, this should be totally safe. The input is read from stdin only by the main thread and then batches are delegated to the workers, and these write do different files, after which one thread merges them. So there should be no races.

If you notice any problems, let us know.

After creating an index with this enabled everything looked okay. So you are right it looks like it is safe. The main question would be where to enable this in the program? At the start or only at the beginning of the while loop in the forward_index_builder

I think we should do it in the parse_collection tool itself. Right now, it would probably not make any difference, but I'd rather not change global state in one of the algorithms, it might lead to some surprising bugs down the road.