frictionlessdata/tabulator-py

XLS files are not supported

cschloer opened this issue · 4 comments

Overview

I'm somewhat confused by this because I've seen XLS files work before. Maybe it's just an older version of the filetype that isn't supported.

I will share the file as soon as I get permission to release it into the public

Running a pipeline that uses tabulator with a xls file returns the error:

bcodmo_pipeline_processors.load

Traceback (most recent call last):

....

File "/home/conrad/.virtualenvs/laminar-server/lib/python3.7/site-packages/tabulator/stream.py", line 341, in open
self.__parser.open(source, encoding=self.__encoding)

File "/home/conrad/.virtualenvs/laminar-server/lib/python3.7/site-packages/tabulator/parsers/xlsx.py", line 70, in open
self.__bytes, read_only=not self.__fill_merged_cells, data_only=True)

File "/home/conrad/.virtualenvs/laminar-server/lib/python3.7/site-packages/openpyxl/reader/excel.py", line 313, in load_workbook
data_only, keep_links)

File "/home/conrad/.virtualenvs/laminar-server/lib/python3.7/site-packages/openpyxl/reader/excel.py", line 124, in __init__
self.archive = _validate_archive(fn)

File "/home/conrad/.virtualenvs/laminar-server/lib/python3.7/site-packages/openpyxl/reader/excel.py", line 96, in _validate_archive
archive = ZipFile(filename, 'r')

File "/usr/lib/python3.7/zipfile.py", line 1225, in __init__
self._RealGetContents()

File "/usr/lib/python3.7/zipfile.py", line 1292, in _RealGetContents
raise BadZipFile("File is not a zip file")

zipfile.BadZipFile: File is not a zip file

When I tried inside the python interpreter to use openpyxl library to load the file from path I got the more clear error:

>>> from openpyxl import load_workbook

>>> wb = load_workbook(filename="/path/to/xls_problem.xls")
Traceback (most recent call last):
...
    raise InvalidFileException(msg)
openpyxl.utils.exceptions.InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.
>>> 

Can we update the XLSX parser to fall back to xlrd when it's an XLS file?


Please preserve this line to notify @roll (lead of this repository)

xls_problem.xls.txt
Here's the file - just make sure to remove the .txt ending

roll commented

Hi @cschloer,

the file works fine with tabulator:

from tabulator import Stream

with Stream('tmp/issue301.xls', sheet='Fig. 1') as stream:
     print(stream.read())
# [['Year', 'Month', 'Site', 'Coral cover %', 'SE coral cover', 'Lobe density mean', 'SE lobe density', 'N quadrats', 'Mean lobe size (cm)', 'SE lobe size (cm)', 'N (lobes)'], [1988, 3, 'Tektite', 32.8, 2.8, 35.8, 6, 30, 67.8, 2.94, 805], ...]

I think the problem that you need to set xls format or not to specify format at all (will be detected). And it looks like here - https://github.com/BCODMO/laminar-web/issues/464 - xls and xlsx are merged as one format.

Please re-open if it didn't help

roll commented

So what I mean:

with Stream('tmp/issue301.xls') as stream: # good
with Stream('tmp/issue301.xls', format='xls') as stream: # good
with Stream('tmp/issue301.xls', format='xlsx') as stream: # error

thanks, that worked!