chezou/tabula-py

Problem with reading two files, one it reads, the other (same format) doesn't

oscarcarrillou opened this issue · 3 comments

Summary of your issue

I have been using tabula for reading some pdf and has been working great, until my client provided a new file in the same format that gives an error when reading, and for the life of me, can't figure out how to correct.

Check list before submit

  • Did you read FAQ?

  • (Optional, but really helpful) Your PDF URL: ?

File that can be read

https://drive.google.com/file/d/1CHhyGQ5Ftykq-YIqUWLETJ3JJa5SEEwP/view?usp=sharing

File that can't
https://drive.google.com/file/d/1CPBDHdO_OEg5_BP5cabwWNaGsFRMRMXQ/view?usp=sharing

  • Paste the output of import tabula; tabula.environment_info() on Python REPL: ?

Python version:
3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
Java version:
java version "1.8.0_341"
Java(TM) SE Runtime Environment (build 1.8.0_341-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.341-b10, mixed mode)
tabula-py version: 2.4.0
platform: Windows-10-10.0.22000-SP0
uname:
uname_result(system='Windows', node='DESKTOP-PNFV4HP', release='10', version='10.0.22000', machine='AMD64')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')

If not possible to execute tabula.environment_info(), please answer following questions manually.

  • Paste the output of python --version command on your terminal: ?
  • Paste the output of java -version command on your terminal: ?
  • Does java -h command work well?; Ensure your java command is included in PATH
  • Write your OS and it's version: ?

What did you do when you faced the problem?

Sorry for mess, I'm new at this.

Code:

import tabula
import pandas as pd
import string
import numpy as np
import re
import pdfplumber
import os
import gc
from pathlib import Path
from string import Formatter
from string import Template

datos = pd.DataFrame(columns=['factura','pin','nombre','cedula','direccion','ciudad', 'telefono','referencia','descripcion','cantidad','precio_uni','precio_tot','subtotal', 'IVA', 'Bono', 'TOTAL'])
for p in Path('data').glob('*.pdf'):
file = p.name
path = 'data/' + file
try:
dfs = tabula.read_pdf(path, pages=1,output_format="dataframe",pandas_options={'header': None})
print(file, "Procesado")
band = 1
except:
print("No se pudo extraer el archivo ", file)
band = 0
if band == 1:
pdf = pdfplumber.open(path)
p0 = pdf.pages[0]
text = p0.extract_text()
factura = text[20:24]
pin = text[74:81]
#dfs = tabula.read_pdf(path, pages=1,output_format="dataframe",pandas_options={'header': None})

    data = pd.DataFrame(dfs[0])
    column = len(data.columns)

    if type(dfs[0][1][0])!= float:
        nombre=dfs[0][1][0]
    else:
        nombre=dfs[0][0][0]

    cedula = dfs[0][3][0]
    if dfs[0][0][1] == "Dirección:":
        direccion = dfs[0][1][1]

    else:
        direccion = dfs[0][0][1]
    if dfs[0][4][0] == "Teléfono:":
        telefono = ""
    else:
        telefono = dfs[0][4][0]

    ciudad = dfs[0][3][1]

    if len(cedula)>10:
        texto = cedula.split(" ")
        cedula = texto[0]
        telefono = texto[2]


    if type(dfs[0][4][2]) == str :
        z = 4
    else:
        z = 3

    for j in range(len(dfs[0][4])):
        if dfs[0][z][j] == 'Subtotal':
            subtotal = dfs[0][column-1][j] 
            IVA = dfs[0][column-1][j+1]
            if dfs[0][z][j+2] == 'Bono prom':
                bono = dfs[0][column-1][j+2]
                total = dfs[0][column-1][j+3]
            else:
                bono = ""
                total = dfs[0][column-1][j+2]

    x = 3
    while type(dfs[0][0][x])!= float:
        referencia=dfs[0][0][x]
        descripcion=dfs[0][1][x]
        cantidad=dfs[0][3][x]
        if type(dfs[0][4][2])!= float:
            precio_uni=dfs[0][4][x]
        else:
            precio_uni=dfs[0][3][x]
        if len(cantidad)>2:
            cant = cantidad.split("19%")
            cantidad = cant[0]
            precio_uni = cant[1]
        else:
            cantidad=dfs[0][3][x]
        precio_tot=dfs[0][column-1][x]

        lista = [factura,pin,nombre, cedula, direccion, ciudad, telefono,referencia,descripcion,cantidad,precio_uni,precio_tot,subtotal, IVA, bono, total]
        datos.loc[len(datos)] = lista
        x+= 1

datos['cedula'] = datos['cedula'].str.replace('Teléfono:','')
new = datos["cedula"].str.split(" ", n = 1, expand = True)
datos["cedula"]= new[0]
datos['direccion'] = datos['direccion'].str.replace('Dirección:','')
datos['nombre'] = datos['nombre'].str.replace('NIT/Cédula:','')
datos['nombre'] = datos['nombre'].str.replace('Beneficiario:','')
datos['direccion'] = datos['direccion'].str.replace('Ciudad:','')
datos['telefono'] = datos['telefono'].str.replace('Teléfono:','')
datos['precio_uni'] = datos['precio_uni'].str.replace('19%','')
datos['ciudad'] = datos['ciudad'].str.replace('País:','')

file_name = 'Clientes.xlsx'
datos.to_excel(file_name)
print('Archivo Clientes.xlsx generado.')

paste your core code which minimum reproducible for the issue

Expected behavior:

write your expected output
File read

Actual behavior:

Traceback (most recent call last)
~\anaconda3\lib\site-packages\tabula\io.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
345 try:
--> 346 return [pd.read_csv(io.BytesIO(output), **pandas_options)]
347 except pd.errors.ParserError as e:

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
609
--> 610 return _read(filepath_or_buffer, kwds)
611

~\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
467 with parser:
--> 468 return parser.read(nrows)
469

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
1056 nrows = validate_integer("nrows", nrows)
-> 1057 index, columns, col_dict = self._engine.read(nrows)
1058

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
2035 try:
-> 2036 data = self._reader.read(nrows)
2037 except StopIteration:

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

During handling of the above exception, another exception occurred:

CSVParseError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_6740/1469781751.py in
2 file = p.name
3 path = 'data/' + file
----> 4 dfs = tabula.read_pdf(path, pages=1,output_format="dataframe",pandas_options={'header': None})
5 print(file, "Procesado")
6 band = 1

~\anaconda3\lib\site-packages\tabula\io.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
352 )
353
--> 354 raise CSVParseError(message, e)
355
356

CSVParseError: Error failed to create DataFrame with different column tables.
Try to set multiple_tables=Trueor set names option for pandas_options.
, caused by ParserError('Error tokenizing data. C error: Expected 2 fields in line 3, saw 3\n')

paste your output

6271.pdf
6272.pdf

Related Issues:

output_format doesn't work with multiple_tables, which is set true by default for multiple table parsing. Please remove the option.

>>> fname = "~/Downloads/6272.pdf"
>>> import tabula
>>> tabula.read_pdf(fname, pages=1, pandas_options={'header': None})
[    0   1
0 NaN NaN,     0
0 NaN,     0   1   2
0 NaN NaN NaN,     0
0 NaN,     0
0 NaN,     0   1
0 NaN NaN,                                                    0                                    1                        2         3               4               5
0  Beneficiario:Lorena AmayaNIT/Cédula:1098640829...                                  NaN                      NaN       NaN             NaN             NaN
1                                         Referencia                          Descripción                 Cantidad       IVA  Valor Unitario     Valor Total
2                                  Porcelain\rARCANA  Porcelain, porcelain/-\rARCANA, -/-                     1\r1  19%\r19%  55,462\r30,252  55,462\r30,252
3                                                NaN                 Subtotal\rIVA\rTOTAL  85,714\r16,286\r102,000       NaN             NaN             NaN]
>>> tabula.read_pdf(fname, pages=1, pandas_options={'header': None}, output_format="dataframe")
Traceback (most recent call last):
  File "/Users/ariga/src/tabula-py/tabula/io.py", line 353, in read_pdf
    return [pd.read_csv(io.BytesIO(output), **pandas_options)]
  File "/Users/ariga/src/tabula-py/.venv/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/Users/ariga/src/tabula-py/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/Users/ariga/src/tabula-py/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 581, in _read
    return parser.read(nrows)
  File "/Users/ariga/src/tabula-py/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1254, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/Users/ariga/src/tabula-py/.venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ariga/src/tabula-py/tabula/io.py", line 361, in read_pdf
    raise CSVParseError(message, e)
tabula.errors.CSVParseError: Error failed to create DataFrame with different column tables.
Try to set `multiple_tables=True`or set `names` option for `pandas_options`.
, caused by ParserError('Error tokenizing data. C error: Expected 2 fields in line 3, saw 3\n')

I don't think you understand the bug, with the file 6271.pdf it works ok, but it launches error with file 6272.pdf, even though they are generated by the same software, and look the same.

Did you try dropping meaningless "output_format" option?

I was trying to create minimal reproducible example, that is required to provide by issue reporter, and I found an issue that setting output_format option enforces multi_table=False while its default is True.

If multiple_table=False and tabula-java extracts multiple tables, tabula-py fails to extract, regardless of similarity of the looking of tables.

If you want me to continue free consultation for your client without your cooperation, I can do nothing anymore.