You're parsing an XML document using an HTML parser
lucazav opened this issue · 1 comments
I'm running the demo code, referencing a specific grobid url:
import scipdf
article_dict = scipdf.parse_pdf_to_dict('examples/futoma2017improved.pdf',
grobid_url='https://<my-grobid-url>/')
I'm getting the following error:
/anaconda/envs/scipdfparser/lib/python3.9/site-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument
features="xml"
into the BeautifulSoup constructor.
I'm running scipdf in a conda environment with Python 3.9.16. Here the installed packages:
# packages in environment at /anaconda/envs/scipdfparser:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
asttokens 2.0.5 pyhd3eb1b0_0
backcall 0.2.0 pyhd3eb1b0_0
beautifulsoup4 4.11.2 pypi_0 pypi
blas 1.0 mkl
blis 0.7.9 pypi_0 pypi
ca-certificates 2023.01.10 h06a4308_0
catalogue 2.0.8 pypi_0 pypi
certifi 2022.12.7 py39h06a4308_0
charset-normalizer 3.0.1 pypi_0 pypi
click 8.1.3 pypi_0 pypi
comm 0.1.2 py39h06a4308_0
confection 0.0.4 pypi_0 pypi
cymem 2.0.7 pypi_0 pypi
debugpy 1.5.1 py39h295c915_0
decorator 5.1.1 pyhd3eb1b0_0
en-core-web-sm 3.5.0 pypi_0 pypi
entrypoints 0.4 py39h06a4308_0
executing 0.8.3 pyhd3eb1b0_0
idna 3.4 pypi_0 pypi
intel-openmp 2021.4.0 h06a4308_3561
ipykernel 6.19.2 py39hb070fc8_0
ipython 8.8.0 py39h06a4308_0
jedi 0.18.1 py39h06a4308_1
jinja2 3.1.2 pypi_0 pypi
jupyter_client 7.4.8 py39h06a4308_0
jupyter_core 5.1.1 py39h06a4308_0
langcodes 3.3.0 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
libffi 3.4.2 h6a678d5_6
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libsodium 1.0.18 h7b6447c_0
libstdcxx-ng 11.2.0 h1234567_1
lxml 4.9.2 pypi_0 pypi
markupsafe 2.1.2 pypi_0 pypi
matplotlib-inline 0.1.6 py39h06a4308_0
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py39h7f8727e_0
mkl_fft 1.3.1 py39hd3c417c_0
mkl_random 1.2.2 py39h51133e4_0
murmurhash 1.0.9 pypi_0 pypi
ncurses 6.4 h6a678d5_0
nest-asyncio 1.5.6 py39h06a4308_0
numpy 1.23.5 py39h14f4228_0
numpy-base 1.23.5 py39h31eccc5_0
openssl 1.1.1s h7f8727e_0
packaging 22.0 py39h06a4308_0
pandas 1.5.3 pypi_0 pypi
parso 0.8.3 pyhd3eb1b0_0
pathy 0.10.1 pypi_0 pypi
pexpect 4.8.0 pyhd3eb1b0_3
pickleshare 0.7.5 pyhd3eb1b0_1003
pip 22.3.1 py39h06a4308_0
platformdirs 2.5.2 py39h06a4308_0
preshed 3.0.8 pypi_0 pypi
prompt-toolkit 3.0.36 py39h06a4308_0
psutil 5.9.0 py39h5eee18b_0
ptyprocess 0.7.0 pyhd3eb1b0_2
pure_eval 0.2.2 pyhd3eb1b0_0
pydantic 1.10.4 pypi_0 pypi
pygments 2.11.2 pyhd3eb1b0_0
pyphen 0.13.2 pypi_0 pypi
python 3.9.16 h7a1cb2a_0
python-dateutil 2.8.2 pyhd3eb1b0_0
pytz 2022.7.1 pypi_0 pypi
pyzmq 23.2.0 py39h6a678d5_0
readline 8.2 h5eee18b_0
requests 2.28.2 pypi_0 pypi
scipdf 0.1.dev0 pypi_0 pypi
setuptools 65.6.3 py39h06a4308_0
six 1.16.0 pyhd3eb1b0_1
smart-open 6.3.0 pypi_0 pypi
soupsieve 2.3.2.post1 pypi_0 pypi
spacy 3.5.0 pypi_0 pypi
spacy-legacy 3.0.12 pypi_0 pypi
spacy-loggers 1.0.4 pypi_0 pypi
sqlite 3.40.1 h5082296_0
srsly 2.4.5 pypi_0 pypi
stack_data 0.2.0 pyhd3eb1b0_0
textstat 0.7.3 pypi_0 pypi
thinc 8.1.7 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
tornado 6.2 py39h5eee18b_0
tqdm 4.64.1 pypi_0 pypi
traitlets 5.7.1 py39h06a4308_0
typer 0.7.0 pypi_0 pypi
typing-extensions 4.4.0 pypi_0 pypi
tzdata 2022g h04d1e81_0
urllib3 1.26.14 pypi_0 pypi
wasabi 1.1.1 pypi_0 pypi
wcwidth 0.2.5 pyhd3eb1b0_0
wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.10 h5eee18b_1
zeromq 4.3.4 h2531618_0
zlib 1.2.13 h5eee18b_0
I also have the same problem. I think it is caused by the updated bs4. It doesn't matter and I guess this warning will not influence the process. and I try to change the features to xml, which will throw errors. you can filter the warning by this way.
import warnings;from bs4.builder import XMLParsedAsHTMLWarning;warnings.filterwarnings('ignore', category=XMLParsedAsHTMLWarning)