Problem with reading two files, one it reads, the other (same format) doesn't
oscarcarrillou opened this issue · 3 comments
Summary of your issue
I have been using tabula for reading some pdf and has been working great, until my client provided a new file in the same format that gives an error when reading, and for the life of me, can't figure out how to correct.
Check list before submit
-
Did you read FAQ?
-
(Optional, but really helpful) Your PDF URL: ?
File that can be read
https://drive.google.com/file/d/1CHhyGQ5Ftykq-YIqUWLETJ3JJa5SEEwP/view?usp=sharing
File that can't
https://drive.google.com/file/d/1CPBDHdO_OEg5_BP5cabwWNaGsFRMRMXQ/view?usp=sharing
- Paste the output of
import tabula; tabula.environment_info()
on Python REPL: ?
Python version:
3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
Java version:
java version "1.8.0_341"
Java(TM) SE Runtime Environment (build 1.8.0_341-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.341-b10, mixed mode)
tabula-py version: 2.4.0
platform: Windows-10-10.0.22000-SP0
uname:
uname_result(system='Windows', node='DESKTOP-PNFV4HP', release='10', version='10.0.22000', machine='AMD64')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')
If not possible to execute tabula.environment_info()
, please answer following questions manually.
- Paste the output of
python --version
command on your terminal: ? - Paste the output of
java -version
command on your terminal: ? - Does
java -h
command work well?; Ensure your java command is included inPATH
- Write your OS and it's version: ?
What did you do when you faced the problem?
Sorry for mess, I'm new at this.
Code:
import tabula
import pandas as pd
import string
import numpy as np
import re
import pdfplumber
import os
import gc
from pathlib import Path
from string import Formatter
from string import Template
datos = pd.DataFrame(columns=['factura','pin','nombre','cedula','direccion','ciudad', 'telefono','referencia','descripcion','cantidad','precio_uni','precio_tot','subtotal', 'IVA', 'Bono', 'TOTAL'])
for p in Path('data').glob('*.pdf'):
file = p.name
path = 'data/' + file
try:
dfs = tabula.read_pdf(path, pages=1,output_format="dataframe",pandas_options={'header': None})
print(file, "Procesado")
band = 1
except:
print("No se pudo extraer el archivo ", file)
band = 0
if band == 1:
pdf = pdfplumber.open(path)
p0 = pdf.pages[0]
text = p0.extract_text()
factura = text[20:24]
pin = text[74:81]
#dfs = tabula.read_pdf(path, pages=1,output_format="dataframe",pandas_options={'header': None})
data = pd.DataFrame(dfs[0])
column = len(data.columns)
if type(dfs[0][1][0])!= float:
nombre=dfs[0][1][0]
else:
nombre=dfs[0][0][0]
cedula = dfs[0][3][0]
if dfs[0][0][1] == "Dirección:":
direccion = dfs[0][1][1]
else:
direccion = dfs[0][0][1]
if dfs[0][4][0] == "Teléfono:":
telefono = ""
else:
telefono = dfs[0][4][0]
ciudad = dfs[0][3][1]
if len(cedula)>10:
texto = cedula.split(" ")
cedula = texto[0]
telefono = texto[2]
if type(dfs[0][4][2]) == str :
z = 4
else:
z = 3
for j in range(len(dfs[0][4])):
if dfs[0][z][j] == 'Subtotal':
subtotal = dfs[0][column-1][j]
IVA = dfs[0][column-1][j+1]
if dfs[0][z][j+2] == 'Bono prom':
bono = dfs[0][column-1][j+2]
total = dfs[0][column-1][j+3]
else:
bono = ""
total = dfs[0][column-1][j+2]
x = 3
while type(dfs[0][0][x])!= float:
referencia=dfs[0][0][x]
descripcion=dfs[0][1][x]
cantidad=dfs[0][3][x]
if type(dfs[0][4][2])!= float:
precio_uni=dfs[0][4][x]
else:
precio_uni=dfs[0][3][x]
if len(cantidad)>2:
cant = cantidad.split("19%")
cantidad = cant[0]
precio_uni = cant[1]
else:
cantidad=dfs[0][3][x]
precio_tot=dfs[0][column-1][x]
lista = [factura,pin,nombre, cedula, direccion, ciudad, telefono,referencia,descripcion,cantidad,precio_uni,precio_tot,subtotal, IVA, bono, total]
datos.loc[len(datos)] = lista
x+= 1
datos['cedula'] = datos['cedula'].str.replace('Teléfono:','')
new = datos["cedula"].str.split(" ", n = 1, expand = True)
datos["cedula"]= new[0]
datos['direccion'] = datos['direccion'].str.replace('Dirección:','')
datos['nombre'] = datos['nombre'].str.replace('NIT/Cédula:','')
datos['nombre'] = datos['nombre'].str.replace('Beneficiario:','')
datos['direccion'] = datos['direccion'].str.replace('Ciudad:','')
datos['telefono'] = datos['telefono'].str.replace('Teléfono:','')
datos['precio_uni'] = datos['precio_uni'].str.replace('19%','')
datos['ciudad'] = datos['ciudad'].str.replace('País:','')
file_name = 'Clientes.xlsx'
datos.to_excel(file_name)
print('Archivo Clientes.xlsx generado.')
paste your core code which minimum reproducible for the issue
Expected behavior:
write your expected output
File read
Actual behavior:
Traceback (most recent call last)
~\anaconda3\lib\site-packages\tabula\io.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
345 try:
--> 346 return [pd.read_csv(io.BytesIO(output), **pandas_options)]
347 except pd.errors.ParserError as e:
~\anaconda3\lib\site-packages\pandas\io\parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
609
--> 610 return _read(filepath_or_buffer, kwds)
611
~\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
467 with parser:
--> 468 return parser.read(nrows)
469
~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
1056 nrows = validate_integer("nrows", nrows)
-> 1057 index, columns, col_dict = self._engine.read(nrows)
1058
~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
2035 try:
-> 2036 data = self._reader.read(nrows)
2037 except StopIteration:
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3
During handling of the above exception, another exception occurred:
CSVParseError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_6740/1469781751.py in
2 file = p.name
3 path = 'data/' + file
----> 4 dfs = tabula.read_pdf(path, pages=1,output_format="dataframe",pandas_options={'header': None})
5 print(file, "Procesado")
6 band = 1
~\anaconda3\lib\site-packages\tabula\io.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
352 )
353
--> 354 raise CSVParseError(message, e)
355
356
CSVParseError: Error failed to create DataFrame with different column tables.
Try to set multiple_tables=True
or set names
option for pandas_options
.
, caused by ParserError('Error tokenizing data. C error: Expected 2 fields in line 3, saw 3\n')
paste your output
Related Issues:
output_format
doesn't work with multiple_tables
, which is set true
by default for multiple table parsing. Please remove the option.
>>> fname = "~/Downloads/6272.pdf"
>>> import tabula
>>> tabula.read_pdf(fname, pages=1, pandas_options={'header': None})
[ 0 1
0 NaN NaN, 0
0 NaN, 0 1 2
0 NaN NaN NaN, 0
0 NaN, 0
0 NaN, 0 1
0 NaN NaN, 0 1 2 3 4 5
0 Beneficiario:Lorena AmayaNIT/Cédula:1098640829... NaN NaN NaN NaN NaN
1 Referencia Descripción Cantidad IVA Valor Unitario Valor Total
2 Porcelain\rARCANA Porcelain, porcelain/-\rARCANA, -/- 1\r1 19%\r19% 55,462\r30,252 55,462\r30,252
3 NaN Subtotal\rIVA\rTOTAL 85,714\r16,286\r102,000 NaN NaN NaN]
>>> tabula.read_pdf(fname, pages=1, pandas_options={'header': None}, output_format="dataframe")
Traceback (most recent call last):
File "/Users/ariga/src/tabula-py/tabula/io.py", line 353, in read_pdf
return [pd.read_csv(io.BytesIO(output), **pandas_options)]
File "/Users/ariga/src/tabula-py/.venv/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/Users/ariga/src/tabula-py/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/Users/ariga/src/tabula-py/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 581, in _read
return parser.read(nrows)
File "/Users/ariga/src/tabula-py/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1254, in read
index, columns, col_dict = self._engine.read(nrows)
File "/Users/ariga/src/tabula-py/.venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ariga/src/tabula-py/tabula/io.py", line 361, in read_pdf
raise CSVParseError(message, e)
tabula.errors.CSVParseError: Error failed to create DataFrame with different column tables.
Try to set `multiple_tables=True`or set `names` option for `pandas_options`.
, caused by ParserError('Error tokenizing data. C error: Expected 2 fields in line 3, saw 3\n')
I don't think you understand the bug, with the file 6271.pdf it works ok, but it launches error with file 6272.pdf, even though they are generated by the same software, and look the same.
Did you try dropping meaningless "output_format" option?
I was trying to create minimal reproducible example, that is required to provide by issue reporter, and I found an issue that setting output_format
option enforces multi_table=False
while its default is True
.
If multiple_table=False
and tabula-java extracts multiple tables, tabula-py fails to extract, regardless of similarity of the looking of tables.
If you want me to continue free consultation for your client without your cooperation, I can do nothing anymore.