chezou/tabula-py

Superscript numbers in PDF coerce to be a normal number

drewbeh opened this issue · 1 comments

Summary of your issue

I am attempting to extract some data that contains superscripts. Image of the number in question: https://i.stack.imgur.com/tdXKR.png

  • Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: PDF in question, page 147 is the table https://edisciplinas.usp.br/pluginfile.php/4557662/mod_resource/content/1/CRC%20Handbook%20of%20Chemistry%20and%20Physics%2095th%20Edition.pdf

  • Paste the output of import tabula; tabula.environment_info() on Python REPL:
    Python version:
    3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
    [GCC 9.4.0]
    Java version:
    openjdk version "11.0.16" 2022-07-19
    OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04)
    OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)
    tabula-py version: 2.5.1
    platform: Linux-5.15.65+-x86_64-with-debian-bullseye-sid
    uname:
    uname_result(system='Linux', node='2e4bec642b2a', release='5.15.65+', version='#1 SMP Sat Oct 22 09:37:52 UTC 2022', machine='x86_64', processor='x86_64')
    linux_distribution: ('Ubuntu', '20.04', 'focal')
    mac_ver: ('', ('', '', ''), '')

If not possible to execute tabula.environment_info(), please answer following questions manually.

  • Paste the output of python --version command on your terminal: 3.7.12
  • Paste the output of java -version command on your terminal: 11.0.16
  • Does java -h command work well?; Ensure your java command is included in PATH
  • Write your OS and it's version: OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04)
    Kaggle kernel

tabula-py version: 2.5.1

What did you do when you faced the problem?

Code:

import tabula
tabula.read_pdf(pdf_path, pages=147, multiple_tables=False, stream=True, guess=False,
                       area = (54.2, 53.8, 794.3, 615.0),
                       columns = (70.1, 152.9, 236.8, 287.7, 324.9, 351.9, 387.0, 423.2, 456.8, 487.9, 514.3, 559.9))

Expected behavior:

250^9 or superscript just ignored so 250

Actual behavior:

2509

Related Issues:

@drewbeh this issue was automatically closed because it did not follow the issue template