MicheleCotrufo/pdf2doi

Clash with other pdf extractions libraries

cmartinotti opened this issue · 1 comments

I use a bunch of other pdf extraction tools like tabula, camelot and layout parser and it seems that pdf2doi is using an older version of pdfminer-six which gives problems when coexisting with these libraries. When installing with pip in the same env in which i use layoutparser and camelot i get this error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.6.0 requires pdfminer.six==20211012, but you have pdfminer-six 20181108 which is incompatible.
google-api-core 1.31.5 requires six>=1.13.0, but you have six 1.12.0 which is incompatible.
camelot-py 0.10.1 requires pdfminer.six>=20200726, but you have pdfminer-six 20181108 which is incompatible.

Is there a workaround to this problem?

Yes, this is due to the fact that, back to when I started putting pdf2doi together, some of the libraries that I used (specifically textract and pdf-title) required different versions of six and pdfminer. To solve the problem I made sure that pdf2doi only used textract==1.6.3, six==1.12.0 and pdfminer==20181108.

I checked again now, and it seems that this incompatibility does not exist anymore. I released another rc version of pdf2doi, where I do not set any explicit constraint on the version of six and pdfminer, and I use textract==1.6.4.

pip install pdf2doi==1.1rc2

Note that you might still get an error from pip because, for example, textract==1.6.4 requires six==1.12.0, while the library you need (google-api-core) seems to require six>=1.13.0.
This is not necessarily a problem, since textract will also work with six>=1.13.0 (but it will trigger an error anyway)
Just make sure that, after installing pdf2doi, you also upgrade six and pdfminer to the versions that you need.

Let me know if this new version works well and if you find any bug! thanks!