HXLStandard/libhxl-python

Timeout reading long XLSX files with many columns

Closed this issue · 8 comments

The HXL Proxy is timing out trying to read the Excel file at

https://feature-data.humdata.org/dataset/9d601e4f-e233-4c67-b610-b03d234cc8a4/resource/e4c6aac3-f67b-4a39-bdd6-3bec0d5c40e7/download/reach_som_jmcna_final_dataset_august_2018.xlsx

When opened in LibreOffice, there's an error message about too many columns.

Reported by @danmihaila

There is no error message, but it takes 51 seconds for the xlrd library to parse the file. Combine that with 10-12 seconds download time, and a bit more time for generating the result, and we're probably timing out in the HXL Proxy over a web connection.

Changing the title to indicate that the problem is the long time to parse the XLSX.

Proposed solution:

  1. Switch to using the sxl library for .xlsx files — it's much faster, and doesn't require loading the whole file into memory.

  2. Continue using the xlrd library for legacy .xls files, since sxl doesn't support them

@danmihaila - I've switched to the recommended sxl library, but while it gives a result in about 10 seconds on my laptop, it's still not fast enough to avoid a timeout on the beta VM:

https://beta.proxy.hxlstandard.org/api/data-preview.csv?url=https%3A%2F%2Ffeature-data.humdata.org%2Fdataset%2F9d601e4f-e233-4c67-b610-b03d234cc8a4%2Fresource%2Fe4c6aac3-f67b-4a39-bdd6-3bec0d5c40e7%2Fdownload%2Freach_som_jmcna_final_dataset_august_2018.xlsx

I've also found bugs in the sxl library, where the coder made assumptions about XLSX files that sometimes don't hold up. I think we might not be able to handle this file on demand in a web environment -- let's talk.

Not doing (for now).

Duplicate of #277