Timeout reading long XLSX files with many columns

Question

Timeout reading long XLSX files with many columns

Closed this issue 3 years ago · 8 comments

The HXL Proxy is timing out trying to read the Excel file at

https://feature-data.humdata.org/dataset/9d601e4f-e233-4c67-b610-b03d234cc8a4/resource/e4c6aac3-f67b-4a39-bdd6-3bec0d5c40e7/download/reach_som_jmcna_final_dataset_august_2018.xlsx

When opened in LibreOffice, there's an error message about too many columns.

Reported by @danmihaila

Answer 1 · 2020-03-18T17:27:20.000Z

There is no error message, but it takes 51 seconds for the xlrd library to parse the file. Combine that with 10-12 seconds download time, and a bit more time for generating the result, and we're probably timing out in the HXL Proxy over a web connection.

Answer 2 · 2020-03-18T19:03:44.000Z

I can do data-preview on my local laptop, probably because I have a longer web-connection timeout: https://dev.proxy.hxlstandard.org/api/data-preview.csv?url=https%3A%2F%2Ffeature-data.humdata.org%2Fdataset%2F9d601e4f-e233-4c67-b610-b03d234cc8a4%2Fresource%2Fe4c6aac3-f67b-4a39-bdd6-3bec0d5c40e7%2Fdownload%2Freach_som_jmcna_final_dataset_august_2018.xlsx

Answer 3 · 2020-03-18T19:04:24.000Z

Changing the title to indicate that the problem is the long time to parse the XLSX.

Answer 4 · 2020-03-18T19:06:39.000Z

See https://stackoverflow.com/questions/31181042/xlrd-very-slow-opening-excel-file

Answer 5 · 2020-03-18T21:14:44.000Z

Proposed solution:

Switch to using the sxl library for .xlsx files — it's much faster, and doesn't require loading the whole file into memory.
Continue using the xlrd library for legacy .xls files, since sxl doesn't support them

Answer 6 · 2020-03-19T22:48:16.000Z

@danmihaila - I've switched to the recommended sxl library, but while it gives a result in about 10 seconds on my laptop, it's still not fast enough to avoid a timeout on the beta VM:

https://beta.proxy.hxlstandard.org/api/data-preview.csv?url=https%3A%2F%2Ffeature-data.humdata.org%2Fdataset%2F9d601e4f-e233-4c67-b610-b03d234cc8a4%2Fresource%2Fe4c6aac3-f67b-4a39-bdd6-3bec0d5c40e7%2Fdownload%2Freach_som_jmcna_final_dataset_august_2018.xlsx

I've also found bugs in the sxl library, where the coder made assumptions about XLSX files that sometimes don't hold up. I think we might not be able to handle this file on demand in a web environment -- let's talk.

Answer 7 · 2020-04-02T18:14:38.000Z

Not doing (for now).

Answer 8 · 2022-04-21T13:54:16.000Z

Duplicate of #277