Lazy Load DF rows in entity_to_data and entity_to_cutoff dicts
RogerTangos opened this issue · 2 comments
trane.utils.df_group_by_entity_id
returns a dictionary where the entity id is the key, and the value is dataframe of rows with only that entity id.
This dict is then used by trane.utils.CutoffTimeBase.generate_cutoffs
to create a similar dictionary with the entity id as a key, and a value with a tuple: (DF_OF_ROWS, entity_training_cutoff, entity_label_cutoff)
These methods are currently the most significant bottlenecks in running Trane. A significant portion of the slowdown is because trane is copying the DF_OF_ROWS
into the dictionary. There's no reason for this. Trane should instead index the dataframe by entity_id and use the key to access the dataframe. Then the cutoff_dict
can be a dictionary in format {entity_id_1: ( entity_training_cutoff, entity_label_cutoff), entity_id_1: ...}
and won't have to hold a dataframe.
The trane.utils.df_group_by_entity_id
method could then be eliminated entirely.
This will require searching through the code to refactor everything making use of the entity_to_data dict and entity_to_data_cutoff_dict. Then, there will need to be some mechanism for making sure that the DF is indexed by entity_id.
Ideally, this would allow for a custom entities_and_cutoff time object, which would allow storage of a cutoff time data structure, a natural language description of cutoff times, and possibly some other data.
This is resolved in #20 . It doesn't use lazy loading, but is still much more efficient.