Superscript numbers in PDF coerce to be a normal number
drewbeh opened this issue · 1 comments
Summary of your issue
I am attempting to extract some data that contains superscripts. Image of the number in question: https://i.stack.imgur.com/tdXKR.png
-
Did you read FAQ?
-
(Optional, but really helpful) Your PDF URL: PDF in question, page 147 is the table https://edisciplinas.usp.br/pluginfile.php/4557662/mod_resource/content/1/CRC%20Handbook%20of%20Chemistry%20and%20Physics%2095th%20Edition.pdf
-
Paste the output of
import tabula; tabula.environment_info()
on Python REPL:
Python version:
3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
[GCC 9.4.0]
Java version:
openjdk version "11.0.16" 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)
tabula-py version: 2.5.1
platform: Linux-5.15.65+-x86_64-with-debian-bullseye-sid
uname:
uname_result(system='Linux', node='2e4bec642b2a', release='5.15.65+', version='#1 SMP Sat Oct 22 09:37:52 UTC 2022', machine='x86_64', processor='x86_64')
linux_distribution: ('Ubuntu', '20.04', 'focal')
mac_ver: ('', ('', '', ''), '')
If not possible to execute tabula.environment_info()
, please answer following questions manually.
- Paste the output of
python --version
command on your terminal: 3.7.12 - Paste the output of
java -version
command on your terminal: 11.0.16 - Does
java -h
command work well?; Ensure your java command is included inPATH
- Write your OS and it's version: OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04)
Kaggle kernel
tabula-py version: 2.5.1
What did you do when you faced the problem?
Code:
import tabula
tabula.read_pdf(pdf_path, pages=147, multiple_tables=False, stream=True, guess=False,
area = (54.2, 53.8, 794.3, 615.0),
columns = (70.1, 152.9, 236.8, 287.7, 324.9, 351.9, 387.0, 423.2, 456.8, 487.9, 514.3, 559.9))
Expected behavior:
250^9 or superscript just ignored so 250
Actual behavior:
2509
Related Issues:
@drewbeh this issue was automatically closed because it did not follow the issue template