jina-ai/examples

multimodal-pdf: missing system dependencies

alexcg1 opened this issue · 14 comments

On clean AWS ec2, multimodal PDF search fails out of the box because of missing system libraries.

Proposed solution

Update readme to tell users to install:

  • libcairo2
  • libpango-1.0-0
  • libpangocairo-1.0-0

Environment

lsb_release -a output:

Distributor ID: Ubuntu
Description:    Ubuntu 20.04.1 LTS
Release:        20.04
Codename:       focal

The CI passed with a fresh installation at Ubuntu 20.04.2 LTS. @alexcg1 Would you please give it another try to the latest codes?

I've update the tests and the codes at #593

I'm testing on AWS now, and getting strangeness: Indexing starts fine, memory and CPU spikes, then it just chills out using no CPU and just idling after indexing one doc. Then it spikes again for another doc and chills out. I'm now waiting for it to get back to work.
Screenshot from 2021-05-17 12-39-54
Screenshot from 2021-05-17 12-39-15
Screenshot from 2021-05-17 12-37-52

So @alexcg1 still not working, @nan-wang any ideas?

It's a big 🤷 from me I'm afraid

May I get full logs from you and the jina -vf information? It is difficult to reproduce this issue from my side. @alexcg1

From the screenshots you provide, it seems that your program hangs up at the 2nd request. I guess it might related to the memory issues. Would you please try it on another instance with more memory or set export JINA_PARALLEL=1?

@bwanglzu has noticed some issues with the request size already, see here.

@nan-wang I ran it on a fresh venv in a fresh clone just now. Same results

python --version: Python 3.8.5

Running on stock AWS ec2 with Ubuntu 20.04.1 LTS

From the screenshots you provide, it seems that your program hangs up at the 2nd request. I guess it might related to the memory issues. Would you please try it on another instance with more memory or set export JINA_PARALLEL=1?

@alexcg1 Have you tried this? set export JINA_PARALLEL=1

I noticed that there was only 3.8GB memory in the instance. However, running python app.py -t index takes more than 6 GB memory. Setting export JINA_PARALLEL=1 reduced the memory usage to 5GB.

Nice. After doing that python app.py -t index worked perfectly on ec2

And querying seems to work too