masaccio/numbers-parser

_NumbersModel.table_string_key degrades performance

Closed this issue · 3 comments

Thanks for maintaining the useful parser!

string_lookup = {x.string: x.key for x in self._table_strings[table_id]}

When handling a gigantic table, the line where generates a dictionary on every call extremely degrades the performance. Also the line key = max(string_lookup.values()) + 1 seems very expensive as well.

class TableStringsStore(object):
    def __init__(self, entries):
        self.entries = entries
        self.lookup = {x.string: x.key for x in entries}
        self.max_value = max(self.lookup.values()) if self.lookup else 0
    def append_entry(self, entry):
        self.entries.append(entry)
        self.lookup[entry.string] = entry.key
        self.max_value = max(self.max_value, entry.key)
    def clear_field_container(self):
        clear_field_container(self.entries)
        self.lookup = {}
        self.max_value = 0

This is just a sketch, but incrementally updating the string lookup dict and the max value via the wrapper like above did drastically improve the performance when saving a file. Annotating table_string_key() with @lru_cache also looked effective.

Many thanks for spotting that. I've refactored how I manage datalists which are used in various places and for the strings I should now be efficiently cacheing. There are other performance issues I am aware of in saving files but I've not done profiling on that yet.

Do keep the feedback coming - I'm very happy to see someone using the save functions.

You should see the improvements in 3.9.3.

I'm very looking forward to see the improvement and the upcoming release! Another low-hanging fruit I found was to utilize cache in iwafile.find_references (called from containers.ObjectStore.update_dirty_objects). For me the saving feature is fascinating because I think Numbers is a nice tool to present tabular data less cluttered than Excel. I appreciate your effort and keep up the good work!

Yes update_dirty_objects is not overly smart. It is also overly greedy in computing new dependencies rather than tracking what's actually dirty.