chezou/tabula-py

Superscript numbers coerce to be a normal number

drewbeh opened this issue · 1 comments

I am attempting to extract some data that contains superscripts. Image of the number in question: https://i.stack.imgur.com/tdXKR.png

PDF in question, page 147 is the table https://edisciplinas.usp.br/pluginfile.php/4557662/mod_resource/content/1/CRC%20Handbook%20of%20Chemistry%20and%20Physics%2095th%20Edition.pdf

import tabula
tabula.read_pdf(pdf_path, pages=147, multiple_tables=False, stream=True, guess=False,
                       area = (54.2, 53.8, 794.3, 615.0),
                       columns = (70.1, 152.9, 236.8, 287.7, 324.9, 351.9, 387.0, 423.2, 456.8, 487.9, 514.3, 559.9))

The line 4 (index 3) value for bp reads 2509 when in actuality it is 250 superscript 9, pointing to the 9th reference.

Kaggle kernel
Python version: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)
Java version: openjdk version "11.0.16" 2022-07-19
OpenJDK Runtime Environment (build 11.0.16+8-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.16+8-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)
tabula-py version: 2.5.1

@drewbeh this issue was automatically closed because it did not follow the issue template