trane-dev/Trane

Lazy Load DF rows in entity_to_data and entity_to_cutoff dicts

RogerTangos opened this issue · 2 comments

trane.utils.df_group_by_entity_id returns a dictionary where the entity id is the key, and the value is dataframe of rows with only that entity id.

This dict is then used by trane.utils.CutoffTimeBase.generate_cutoffs to create a similar dictionary with the entity id as a key, and a value with a tuple: (DF_OF_ROWS, entity_training_cutoff, entity_label_cutoff)

These methods are currently the most significant bottlenecks in running Trane. A significant portion of the slowdown is because trane is copying the DF_OF_ROWS into the dictionary. There's no reason for this. Trane should instead index the dataframe by entity_id and use the key to access the dataframe. Then the cutoff_dict can be a dictionary in format {entity_id_1: ( entity_training_cutoff, entity_label_cutoff), entity_id_1: ...} and won't have to hold a dataframe.

The trane.utils.df_group_by_entity_id method could then be eliminated entirely.

This will require searching through the code to refactor everything making use of the entity_to_data dict and entity_to_data_cutoff_dict. Then, there will need to be some mechanism for making sure that the DF is indexed by entity_id.

Ideally, this would allow for a custom entities_and_cutoff time object, which would allow storage of a cutoff time data structure, a natural language description of cutoff times, and possibly some other data.

This is resolved in #20 . It doesn't use lazy loading, but is still much more efficient.