nlmatics/nlm-ingestor

KeyError: 'style'

RaphSte opened this issue ยท 21 comments

When trying to run a pdf file through it I get the KeyError: 'style', with the following stacktrace:

error uploading file, stacktrace: Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
    return_dict, _ = ingestor_api.ingest_document(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 198, in parse
    p["style"], p.text, page_width
    ~^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bs4/element.py", line 1573, in __getitem__
    return self.attrs[key]
           ~~~~~~~~~~^^^^^
KeyError: 'style'
Traceback (most recent call last):
  File "/app/nlm_ingestor/ingestion_daemon/__main__.py", line 48, in parse_document
    return_dict, _ = ingestor_api.ingest_document(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/ingestor_api.py", line 37, in ingest_document
    ingestor = pdf_ingestor.PDFIngestor(doc_location, parse_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 35, in __init__
    blocks, _block_texts, _sents, _file_data, result, page_dim, num_pages = parse_blocks(
                                                                            ^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/pdf_ingestor.py", line 176, in parse_blocks
    parsed_doc = visual_ingestor.Doc(pages, ignore_blocks, render_format)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 117, in __init__
    self.parse(pages)
  File "/app/nlm_ingestor/ingestor/visual_ingestor/visual_ingestor.py", line 198, in parse
    p["style"], p.text, page_width
    ~^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bs4/element.py", line 1573, in __getitem__
    return self.attrs[key]
           ~~~~~~~~~~^^^^^
KeyError: 'style'

Steps to reproduce:

(tested on linux server)

  • docker pull ghcr.io/nlmatics/nlm-ingestor:latest
  • docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor
  • After that, from a client:
from llmsherpa.readers import LayoutPDFReader
llmsherpa_api_url = "https://my-url/api/parseDocument?renderFormat=all"

#both mehtods, local and online will produce the same error
pdf_url = "https://arxiv.org/pdf/1910.13461.pdf"
pdf_url = "./arxiv.org/pdf/1910.13461.pdf"


pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

this seems to be the same error as in #24
One comment suggests, that the tika server is not running. How can I verify that?

v0.1.8 and v0.1.7 have this problem for me. v0.1.6 works fine.

Hey @RaphSte did the latest version work for you. If so, can you update me how?

Same issue here, latest version has an issue.

Hey @RaphSte did the latest version work for you. If so, can you update me how?

hey, @shshnk158, no, it diddn't work for me. I'll use v0.1.6 for now.

Yes v0.1.6 is working fine, but it comes with tika-server-standard-nlm-modified-2.4.1_v6.jar, I wanted to try out with the latest jar file [tika-server-standard-nlm-modified-2.9.2_v1.jar](https://github.com/nlmatics/nlm-ingestor/blob/main/jars/tika-server-standard-nlm-modified-2.9.2_v1.jar) any suggestions @ansukla

yes, facing this issue on v0.1.7 and v0.1.8

The issue is because paragraphs are missing metadata. PR #70 solves this issue.

While not merged, you can use it locally with git fetch origin pull/70/head:PR70 and git switch PR70

I'm facing the same issue, trying now to build the container with PR70.

Container build fails with "Failed to build pandas"

@rednag PR #73 is related, but in my case I just updated requirements.txt to have pandas >= 1.24

pandas==1.2.4

@vitorhirota Changing that to >= fixes the pandas error.

Now I'm hitting a problem with python -m nltk.downloader punkt.

โžœ  nlm-ingestor git:(PR70) โœ— docker build --platform=linux/x86_64 -t ohalo-nlm-ingestor .
[+] Building 2.6s (22/24)                                                                                                                                           docker:desktop-linux
 => [internal] load .dockerignore                                                                                                                                                   0.0s
 => => transferring context: 2B                                                                                                                                                     0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                0.0s
 => => transferring dockerfile: 1.54kB                                                                                                                                              0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                               0.4s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                          0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                0.0s
 => [internal] load .dockerignore                                                                                                                                                   0.0s
 => [internal] load metadata for docker.io/library/python:3.11-bookworm                                                                                                             0.4s
 => [ 1/16] FROM docker.io/library/python:3.11-bookworm@sha256:4eee56938c2f48480acb90db616162cfa361f5987dd43e1371e5288ed3e5e95e                                                     0.0s
 => => resolve docker.io/library/python:3.11-bookworm@sha256:4eee56938c2f48480acb90db616162cfa361f5987dd43e1371e5288ed3e5e95e                                                       0.0s
 => [internal] load build context                                                                                                                                                   0.1s
 => => transferring context: 186.95kB                                                                                                                                               0.1s
 => CACHED [ 2/16] RUN apt-get update && apt-get -y --no-install-recommends install libgomp1                                                                                        0.0s
 => CACHED [ 3/16] RUN mkdir -p /usr/share/man/man1 &&   apt-get update -y &&   apt-get install -y openjdk-17-jre-headless                                                          0.0s
 => CACHED [ 4/16] RUN apt-get install -y   libxml2-dev libxslt-dev   build-essential libmagic-dev                                                                                  0.0s
 => CACHED [ 5/16] RUN apt-get install -y   tesseract-ocr   lsb-release   && echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ $(lsb_release -cs) main" | tee /e  0.0s
 => CACHED [ 6/16] RUN apt-get install unzip -y &&   apt-get install git -y &&   apt-get autoremove -y                                                                              0.0s
 => CACHED [ 7/16] WORKDIR /app                                                                                                                                                     0.0s
 => CACHED [ 8/16] COPY . ./                                                                                                                                                        0.0s
 => CACHED [ 9/16] RUN pip install --upgrade pip setuptools                                                                                                                         0.0s
 => CACHED [10/16] RUN apt-get install -y libmagic1                                                                                                                                 0.0s
 => CACHED [11/16] RUN mkdir -p -m 0600 ~/.ssh && ssh-keyscan github.com >> ~/.ssh/known_hosts                                                                                      0.0s
 => CACHED [12/16] RUN pip install -r requirements.txt                                                                                                                              0.0s
 => CACHED [13/16] RUN python -m nltk.downloader stopwords                                                                                                                          0.0s
 => ERROR [14/16] RUN python -m nltk.downloader punkt                                                                                                                               1.6s
------                                                                                                                                                                                   
 > [14/16] RUN python -m nltk.downloader punkt:                                                                                                                                          
0.505 <frozen runpy>:128: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour                                                                                                                                                                          
0.874 [nltk_data] Downloading package punkt to /root/nltk_data...
1.525 [nltk_data]   Unzipping tokenizers/punkt.zip.
1.526 [nltk_data] Error with downloaded zip file
1.526 Error installing package. Retry? [n/y/e]
1.529 Traceback (most recent call last):
1.529   File "<frozen runpy>", line 198, in _run_module_as_main
1.530   File "<frozen runpy>", line 88, in _run_code
1.530   File "/usr/local/lib/python3.11/site-packages/nltk/downloader.py", line 2537, in <module>
1.532     rv = downloader.download(
1.532          ^^^^^^^^^^^^^^^^^^^^
1.532   File "/usr/local/lib/python3.11/site-packages/nltk/downloader.py", line 790, in download
1.533     choice = input().strip()
1.533              ^^^^^^^
1.534 EOFError: EOF when reading a line
------
Dockerfile:34
--------------------
  32 |     RUN pip install -r requirements.txt
  33 |     RUN python -m nltk.downloader stopwords
  34 | >>> RUN python -m nltk.downloader punkt
  35 |     RUN python -c "import tiktoken; tiktoken.get_encoding(\"cl100k_base\")"
  36 |     RUN chmod +x run.sh
--------------------
ERROR: failed to solve: process "/bin/sh -c python -m nltk.downloader punkt" did not complete successfully: exit code: 1

EDIT: Nevermind, turning on my VPN resolved this issue. I really need to switch ISPs... :)

I can confirm I can building the docker image PR #70, with pandas>=1.2.4 works and the container does not show the KeyError. Thanks!

Sorry, this one's on me. First PR was massive code refresh on top of the latest Tika and I missed some key elements that my tests didn't cover. Second PR with jar v2 should resolve it, but waiting on @ansukla or someone to merge here.

Here's the docker image I'm using with everything baked in:
jamesmtc/nlm-ingestor

v0.1.8 and v0.1.7

are you talking about nlm-ingestor version. I cant see v0.1.6 there

v0.1.8 and v0.1.7

are you talking about nlm-ingestor version. I cant see v0.1.6 there

@ddose-inferyx yes, this is about the nlm ingestor version. You can either pull the image directly (see here) or build it yourself selecting the tag v0.1.6
I diddn't try building it myself though. I just pulled the image directly and it worked for me.

v0.1.8 and v0.1.7 have this problem for me. v0.1.6 works fine.

Thanks! @RaphSte , It started working for me as I used -http://localhost:5010/api/parseDocument?renderFormat=all&applyOcr=yes&useNewIndentParser=yes. Using "NewIndentParser=yes." will also work with the latest.

Here's the docker image I'm using with everything baked in: jamesmtc/nlm-ingestor

docker pull ghcr.io/jamesmtc/nlm-ingestor

Error response from daemon: Head "https://ghcr.io/v2/jamesmtc/nlm-ingestor/manifests/latest": denied

Here's the docker image I'm using with everything baked in: jamesmtc/nlm-ingestor

docker pull ghcr.io/jamesmtc/nlm-ingestor

Error response from daemon: Head "https://ghcr.io/v2/jamesmtc/nlm-ingestor/manifests/latest": denied

I was getting the same error, and I just needed to reset my authentication info for ghcr. I removed any preset ghcr configs, then followed the setup instructions here. After that, running docker pull jamesmtc/nlm-ingestor:latest worked fine

Merging changes from @jamesvillarrubia. Apologies for the delay. Thanks James for putting together the fix. Feel free to send me a note on LinkedIn if something needs attention.