Unstructured-IO/unstructured-api

Is enabling "parallel mode" only recommended for `hi_res` strategy?

omikader opened this issue · 5 comments

Describe the bug

I'm playing around with parallel mode and the fast strategy and I was surprised to notice that it took longer to partition my PDF. Is this expected? Is parallel mode only recommended when using hi_res mode?

  • Without parallel mode (30.9s)
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.1M  100 3199k  100 23.0M   103k   763k  0:00:30  0:00:30 --:--:--  767k
curl -O -X 'POST' 'http://localhost:8000/general/v0/general' -H  -H  -F  -F    0.01s user 0.03s system 0% cpu 30.936 total
  • With parallel mode & default parameters (62.7s)
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.2M  100 3267k  100 23.0M  53379   376k  0:01:02  0:01:02 --:--:--  878k
curl -O -X 'POST' 'http://localhost:8000/general/v0/general' -H  -H  -F  -F    0.01s user 0.03s system 0% cpu 1:02.71 total

To Reproduce

  • Filetype: PDF (24.2 MB & 870 pages)
  • Any additional API parameters: strategy: 'fast' chunking_strategy: 'by_title'

Environment:

I'm running unstructured-api as a Docker container on my local machine

omar@Omars-MacBook-Pro % docker run -p 8000:8000 -d --rm --name unstructured-api \
-e UNSTRUCTURED_PARALLEL_MODE_ENABLED='true' \
-e UNSTRUCTURED_PARALLEL_MODE_URL='http://127.0.0.1:8000/general/v0/general' \
downloads.unstructured.io/unstructured-io/unstructured-api:latest \
--port 8000 --host 0.0.0.0

All requests are made using cURL

omar@Omars-MacBook-Pro % time curl -O -X 'POST' \
  'http://localhost:8000/general/v0/general' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'files=@Consolidated Set of Standards.pdf' \
  -F 'chunking_strategy=by_title' \
  -F 'strategy=fast'

Seems like the answer is yes! I just tested the same scenario but with hi_res and saw that parallel mode took 19 minutes compared to 41 minutes without it. It appears that the overhead of the file splitting/consolidation hurts performance for the fast strategy

Leaving this issue open just in case it helps someone else and leads to more explicit guidance in the README (e.g. "This mode is only recommended when using the hi_res strategy")

Hi @awalker4! Would love to get your take on this, if possible. Thank you!

Hi @omikader, sorry for the delay. Correct - parallel mode will have a huge speedup for hi_res, but will otherwise just add overhead. The library code for hi_res is all serialized, and we needed a way to split up the work without redesigning the whole library. The approach of splitting the file and sending out another batch of api requests was a simple way to get the load balancer to do the scaling for us. Since hi_res pdf are so cpu heavy, this unlocks a huge speedup (it's all those Tesseract calls!) Any other filetype/strategy will be done long before the pdf gets split up.

Thanks for calling this out - I'll add a note to the readme. Or, if you have a moment, a pr would be a huge help :)

Also note that we're pushing to do pdf splitting on the client these days - we don't actually have parallel mode set on our server anymore. We've basically reimplemented this logic in the python client.

@awalker4 done! See #395.

And thanks for the tip about client-side splitting! I'm currently using the JS client so looking forward to see that supported over there soon 🙂