KeyError: 'style'
RaphSte opened this issue ยท 21 comments
When trying to run a pdf file through it I get the KeyError: 'style', with the following stacktrace:
error uploading file, stacktrace: Traceback (most recent call last):
File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
return_dict, _ = ingestor_api.ingest_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
self.parse(pages)
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 198, in parse
p["style"], p.text, page_width
~^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/bs4/element.py", line 1573, in __getitem__
return self.attrs[key]
~~~~~~~~~~^^^^^
KeyError: 'style'
Traceback (most recent call last):
File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
return_dict, _ = ingestor_api.ingest_document(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
self.parse(pages)
File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 198, in parse
p["style"], p.text, page_width
~^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/bs4/element.py", line 1573, in __getitem__
return self.attrs[key]
~~~~~~~~~~^^^^^
KeyError: 'style'
Steps to reproduce:
(tested on linux server)
- docker pull ghcr.io/nlmatics/nlm-ingestor:latest
- docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor
- After that, from a client:
from llmsherpa.readers import LayoutPDFReader
llmsherpa_api_url = "https://my-url/api/parseDocument?renderFormat=all"
#both mehtods, local and online will produce the same error
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"
pdf_url = "./arxiv.org/pdf/1910.13461.pdf"
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)
this seems to be the same error as in #24
One comment suggests, that the tika server is not running. How can I verify that?
v0.1.8 and v0.1.7 have this problem for me. v0.1.6 works fine.
Same issue here, latest version has an issue.
Hey @RaphSte did the latest version work for you. If so, can you update me how?
hey, @shshnk158, no, it diddn't work for me. I'll use v0.1.6 for now.
Yes v0.1.6 is working fine, but it comes with tika-server-standard-nlm-modified-2.4.1_v6.jar
, I wanted to try out with the latest jar file [tika-server-standard-nlm-modified-2.9.2_v1.jar](https://github.com/nlmatics/nlm-ingestor/blob/main/jars/tika-server-standard-nlm-modified-2.9.2_v1.jar)
any suggestions @ansukla
yes, facing this issue on v0.1.7 and v0.1.8
The issue is because paragraphs are missing metadata. PR #70 solves this issue.
While not merged, you can use it locally with git fetch origin pull/70/head:PR70
and git switch PR70
I'm facing the same issue, trying now to build the container with PR70.
Container build fails with "Failed to build pandas"
@vitorhirota Changing that to >=
fixes the pandas error.
Now I'm hitting a problem with python -m nltk.downloader punkt
.
โ nlm-ingestor git:(PR70) โ docker build --platform=linux/x86_64 -t ohalo-nlm-ingestor .
[+] Building 2.6s (22/24) docker:desktop-linux
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 1.54kB 0.0s
=> resolve image config for docker.io/docker/dockerfile:experimental 0.4s
=> CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> [internal] load .dockerignore 0.0s
=> [internal] load metadata for docker.io/library/python:3.11-bookworm 0.4s
=> [ 1/16] FROM docker.io/library/python:3.11-bookworm@sha256:4eee56938c2f48480acb90db616162cfa361f5987dd43e1371e5288ed3e5e95e 0.0s
=> => resolve docker.io/library/python:3.11-bookworm@sha256:4eee56938c2f48480acb90db616162cfa361f5987dd43e1371e5288ed3e5e95e 0.0s
=> [internal] load build context 0.1s
=> => transferring context: 186.95kB 0.1s
=> CACHED [ 2/16] RUN apt-get update && apt-get -y --no-install-recommends install libgomp1 0.0s
=> CACHED [ 3/16] RUN mkdir -p /usr/share/man/man1 && apt-get update -y && apt-get install -y openjdk-17-jre-headless 0.0s
=> CACHED [ 4/16] RUN apt-get install -y libxml2-dev libxslt-dev build-essential libmagic-dev 0.0s
=> CACHED [ 5/16] RUN apt-get install -y tesseract-ocr lsb-release && echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ $(lsb_release -cs) main" | tee /e 0.0s
=> CACHED [ 6/16] RUN apt-get install unzip -y && apt-get install git -y && apt-get autoremove -y 0.0s
=> CACHED [ 7/16] WORKDIR /app 0.0s
=> CACHED [ 8/16] COPY . ./ 0.0s
=> CACHED [ 9/16] RUN pip install --upgrade pip setuptools 0.0s
=> CACHED [10/16] RUN apt-get install -y libmagic1 0.0s
=> CACHED [11/16] RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts 0.0s
=> CACHED [12/16] RUN pip install -r requirements.txt 0.0s
=> CACHED [13/16] RUN python -m nltk.downloader stopwords 0.0s
=> ERROR [14/16] RUN python -m nltk.downloader punkt 1.6s
------
> [14/16] RUN python -m nltk.downloader punkt:
0.505 <frozen runpy>:128: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
0.874 [nltk_data] Downloading package punkt to /root/nltk_data...
1.525 [nltk_data] Unzipping tokenizers/punkt.zip.
1.526 [nltk_data] Error with downloaded zip file
1.526 Error installing package. Retry? [n/y/e]
1.529 Traceback (most recent call last):
1.529 File "<frozen runpy>", line 198, in _run_module_as_main
1.530 File "<frozen runpy>", line 88, in _run_code
1.530 File "/usr/local/lib/python3.11/site-packages/nltk/downloader.py", line 2537, in <module>
1.532 rv = downloader.download(
1.532 ^^^^^^^^^^^^^^^^^^^^
1.532 File "/usr/local/lib/python3.11/site-packages/nltk/downloader.py", line 790, in download
1.533 choice = input().strip()
1.533 ^^^^^^^
1.534 EOFError: EOF when reading a line
------
Dockerfile:34
--------------------
32 | RUN pip install -r requirements.txt
33 | RUN python -m nltk.downloader stopwords
34 | >>> RUN python -m nltk.downloader punkt
35 | RUN python -c "import tiktoken; tiktoken.get_encoding(\"cl100k_base\")"
36 | RUN chmod +x run.sh
--------------------
ERROR: failed to solve: process "/bin/sh -c python -m nltk.downloader punkt" did not complete successfully: exit code: 1
EDIT: Nevermind, turning on my VPN resolved this issue. I really need to switch ISPs... :)
I can confirm I can building the docker image PR #70, with pandas>=1.2.4
works and the container does not show the KeyError. Thanks!
Sorry, this one's on me. First PR was massive code refresh on top of the latest Tika and I missed some key elements that my tests didn't cover. Second PR with jar v2 should resolve it, but waiting on @ansukla or someone to merge here.
Here's the docker image I'm using with everything baked in:
jamesmtc/nlm-ingestor
v0.1.8 and v0.1.7
are you talking about nlm-ingestor version. I cant see v0.1.6 there
v0.1.8 and v0.1.7
are you talking about nlm-ingestor version. I cant see v0.1.6 there
@ddose-inferyx yes, this is about the nlm ingestor version. You can either pull the image directly (see ) or build it yourself selecting the tag
I diddn't try building it myself though. I just pulled the image directly and it worked for me.
v0.1.8 and v0.1.7 have this problem for me. v0.1.6 works fine.
Thanks! @RaphSte , It started working for me as I used -http://localhost:5010/api/parseDocument?renderFormat=all&applyOcr=yes&useNewIndentParser=yes. Using "NewIndentParser=yes." will also work with the latest.
Here's the docker image I'm using with everything baked in:
jamesmtc/nlm-ingestor
docker pull ghcr.io/jamesmtc/nlm-ingestor
Error response from daemon: Head "https://ghcr.io/v2/jamesmtc/nlm-ingestor/manifests/latest": denied
Here's the docker image I'm using with everything baked in:
jamesmtc/nlm-ingestor
docker pull ghcr.io/jamesmtc/nlm-ingestor
Error response from daemon: Head "https://ghcr.io/v2/jamesmtc/nlm-ingestor/manifests/latest": denied
I was getting the same error, and I just needed to reset my authentication info for ghcr. I removed any preset ghcr configs, then followed the setup instructions here. After that, running docker pull jamesmtc/nlm-ingestor:latest
worked fine
Merging changes from @jamesvillarrubia. Apologies for the delay. Thanks James for putting together the fix. Feel free to send me a note on LinkedIn if something needs attention.