jsonkenl/xlsxir

Extract cell range from ets table i.e "A3:F10", sheet names and named ranges

harmon25 opened this issue · 9 comments

Not sure if this can be achieved without first grabbing the entire ets table and filtering it.

Would be nice to just pass a string like "A10:G1000" and only get back a list/map/mda of those cols*rows out of ets.

Also workbook.xml contains some useful information when parsing workbooks such as sheet names, and definedNames which correspond to named ranges in excel.

Let me know if you are interested in adding this functionality to your library, I will try to help!

@harmon25 thanks for the idea. My initial thought is it's achievable...just need to think about it a bit. I'll sit down some this weekend and play around with it. Feel free to do the same and we can regroup on some ideas of how to implement. Thanks!

@jkennell Having dug a bit deeper into your implementation - believe we can make some serious improvements.

Have been using openpyxl to parse xlsx files from python - it is quite powerful and much faster than Xlsxir at parsing large worksheets - which makes me believe some optimizations could be made to drastically improve performance.

Firstly - does ETS actually provide any performance benefit?
I am not sure if there is overhead when writing/reading from ets, but considering all we are doing is storing the parsed xlsx file in memory(as an ets table) - why not just store it in a GenServer' state as a plain array or map?

Would probably consume about the same amount of RAM as it would in ets, without the potential overhead of ets...

A GenServer could also be used to contain Workbook state - similar to how state is maintained in openpyxl in python, an object oriented language.

Going to craft a POC tonight that takes a different approach than Xlsxir, by using a GenServer / supervisor to control what is happening to the XML stashed in RAM...

I would like to be able to access worksheets by name - currently this is not possible with the ets implementation...

@harmon25 Sorry, didn't realize I was logged into my other account when I responded. Yes, I think there is a ton of room for improvement. This was my "learn elixir" project and have been planning on refactoring it to make it more efficient but just haven't gotten to it yet (had a baby last year). I agree that a GenServer implementation is a better answer than the current ETS method, I had just learned how to use ETS tables before learning GenServer so it was the method I went with. Would much appreciate some help with the redesign if you're interested.

@kennellroxco congrats on the baby!

I am very much interested, a bit of an elixir novice myself - but can help with a few things - have some experience with sweet_xml as well which i believe will be faster for this size of xml files...

forked, and created a new branch future will be pushing some POC tonight - just got started hour or so ago!

future

Workbook is a supervisor, it can supervise the Worksheet processes and if some error happens in parsing of the sheet xml, could restart the Worksheet process...
That is what i have in mind anyway - also enables some parallel processing of the xml files - if they can be accessed from the zip handler in parallel - haven't tried yet...

Also with sweet_xml may be able to stream the parsing, into Flow?...

Bed time!

I wasn't able to use SweetXml because the streaming had very poor performance after about 5k elements. That's why I used Erlsom instead due to the sax parsing capabilities. I hadn't tried doing anything with Flow yet. My concern with using Flow for parsing is that it has to happen in order unless you can figure out a way to efficiently sort it afterwards.

Going to get into parsing worksheets tonight...
Gonna try parsing row by row, keeping track of row # should allow for sorting rows back into order.

Also possibly re-organize the supervision a bit... kinda want the app to be started with mix... currently does not show up as an application in :observer...

Sounds good. Please keep in mind you also need to ensure the cells within each row are parsed in order (i.e. [[A1, B1, C1],[A2, B2, C2]] ). I found that once you get into the double and triple letters (i.e. AA, AAA) sorting issues become more complex. Also need to account for empty or nil cells in between. For example, if cells A1 and C1 have data but cell B1 is completely empty, B1 will not show up in the xml file. Just FYI...some pains I went through the first time :)

@harmon25 closing this for now. Let me know if you run into issues with your implementation.