Timeout reading long XLSX files with many columns
Closed this issue · 8 comments
The HXL Proxy is timing out trying to read the Excel file at
When opened in LibreOffice, there's an error message about too many columns.
Reported by @danmihaila
There is no error message, but it takes 51 seconds for the xlrd library to parse the file. Combine that with 10-12 seconds download time, and a bit more time for generating the result, and we're probably timing out in the HXL Proxy over a web connection.
I can do data-preview on my local laptop, probably because I have a longer web-connection timeout: https://dev.proxy.hxlstandard.org/api/data-preview.csv?url=https%3A%2F%2Ffeature-data.humdata.org%2Fdataset%2F9d601e4f-e233-4c67-b610-b03d234cc8a4%2Fresource%2Fe4c6aac3-f67b-4a39-bdd6-3bec0d5c40e7%2Fdownload%2Freach_som_jmcna_final_dataset_august_2018.xlsx
Changing the title to indicate that the problem is the long time to parse the XLSX.
Proposed solution:
-
Switch to using the sxl library for .xlsx files — it's much faster, and doesn't require loading the whole file into memory.
-
Continue using the xlrd library for legacy .xls files, since sxl doesn't support them
@danmihaila - I've switched to the recommended sxl library, but while it gives a result in about 10 seconds on my laptop, it's still not fast enough to avoid a timeout on the beta VM:
I've also found bugs in the sxl library, where the coder made assumptions about XLSX files that sometimes don't hold up. I think we might not be able to handle this file on demand in a web environment -- let's talk.
Not doing (for now).
Duplicate of #277