woahdae/simple_xlsx_reader

Fatally high memory usage when opening large spreadsheets

Closed this issue · 3 comments

Repro steps:

  1. Create a simple workbook with several worksheets, and around a million total rows. The resultant .xlsx file is around 27MiB in size.

  2. Open it with the following code:

require 'simple_xlsx_reader'
file = File.open('large.xlsx')
doc = SimpleXlsxReader.open(file)
  1. Observe the Ruby process consumes all available RAM on the system (6.0GiB), then the system locks up necessitating a hard reboot. Presumably it's paging like mad as the drive light is constantly on.

Details:

ruby 2.2.1p85 (2015-02-26 revision 49769) [x86_64-linux]
simple_xlsx_reader (1.0.2)
nokogiri (1.6.6.2)
rubyzip (1.1.7)

Seems a bit harsh on Ruby's part, but I'm not surprised in general. This project loads all the excel data into memory as Ruby strings, which are generally large, and ruby itself has issues with allocating and managing huge amounts of objects. If the sheet is using the "shared strings" feature (an optional part of the xlsx spec this project supports) this might help, but honestly it's probably already doing it if you're generating it in excel.

I've thought about a feature for streaming the sheet, but it's a tough problem, precisely because of that "shared strings" feature. It's basically a map of numbers to strings excel uses to cut down on total memory usage, but means to read an excel file you have to load the whole shared strings map into memory. Good for small or medium files, bad for extremely large files you want to read programmatically as a stream :(

Ideas or pull requests welcome. Are you aware of any excel readers that can do this in Ruby?

Also, could you upload the excel file you're generating? Step 1, "generate an excel file with a million rows," isn't as clear as you might think. Ex, is it using shared strings? What percent of the fields are repeated? How large is each cell? etc. (no need to answer these, just upload the excel file somewhere ;)

Closing 1) due to inactivity, 2) because I think in general for large files you should be using a simpler format like csv where you know you can read the file as a stream.

Feel free to re-open with any new ideas.