chezou/tabula-py

Superscript numbers in PDF coerce to be a normal number

drewbeh opened this issue · 2 comments

Superscript numbers show up concatenated as normal numbers

I am attempting to extract some data that contains superscripts. Image of the number in question: https://i.stack.imgur.com/tdXKR.png

  • Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: PDF in question, page 147 is the table https://edisciplinas.usp.br/pluginfile.php/4557662/mod_resource/content/1/CRC%20Handbook%20of%20Chemistry%20and%20Physics%2095th%20Edition.pdf

  • Paste the output of import tabula; tabula.environment_info() on Python REPL:
    Python version:
    3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
    [GCC 9.4.0]
    Java version:
    openjdk version "11.0.16" 2022-07-19
    OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04)
    OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)
    tabula-py version: 2.5.1
    platform: Linux-5.15.65+-x86_64-with-debian-bullseye-sid
    uname:
    uname_result(system='Linux', node='2e4bec642b2a', release='5.15.65+', version='#1 SMP Sat Oct 22 09:37:52 UTC 2022', machine='x86_64', processor='x86_64')
    linux_distribution: ('Ubuntu', '20.04', 'focal')
    mac_ver: ('', ('', '', ''), '')

If not possible to execute tabula.environment_info(), please answer following questions manually.

  • Paste the output of python --version command on your terminal: 3.7.12
  • Paste the output of java -version command on your terminal: 11.0.16
  • Does java -h command work well?; Ensure your java command is included in PATH
  • Write your OS and it's version: OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04)
    Kaggle kernel

tabula-py version: 2.5.1

What did you do when you faced the problem?

Currently no workaround found, searching for numbers seem to be assigned as standard number and not callable by regex calls for superscript numbers.

Code:

import tabula
tabula.read_pdf(pdf_path, pages=147, multiple_tables=False, stream=True, guess=False,
                       area = (54.2, 53.8, 794.3, 615.0),
                       columns = (70.1, 152.9, 236.8, 287.7, 324.9, 351.9, 387.0, 423.2, 456.8, 487.9, 514.3, 559.9))

Expected behavior:

250^9 or superscript just ignored so 250

Actual behavior:

2509

Related Issues:

@drewbeh this issue was automatically closed because it did not follow the issue template

I am following the template guidelines but cannot seem to keep this issue from being auto closed.