Data Cache

Question

Data Cache

Closed this issue 7 years ago · 4 comments

Currently, all the labelled data is loaded directly into memory. This would be impossible for extremely large datasets. In order to alleviate this, a data cache should be designed that loads from a file chunks of data. There should be a class that holds the sections of data, as well as a class that is the section of data. The data will first try to be accessed via the cache in it's specified section. If it's not found within the index range, then the section will load from file a new chunk of data.

Answer 1 · 2017-09-04T15:58:12.000Z

Once we get to the point of creating the wrapper class for the trainer, the next step will be to load the data in a cache. To further complement this concept, I've provided a diagram to depict how it might be setup. Data should be loaded from the file in caches based on a defined memory. The BigDataCache class should have a _maxAllowableMemory constant or variable implemented as a static method.

The class will further provide the chunk size allocation based on the size of the input data. Below is an example of the input data class UML (not the BigDataCache UML). These objects will define the granularity of the cache chunks. There will be X amount of OutageDataItems in a Chunk, and there will be M amount of Chunks in the Cache. The amount of chunks M and the _maxAllowableMemory of the cache will be modifiable by the developer, whereas the amount of input items X per chunk will be calculated by the defined settings.

From this point, accessing and loading will be straightforward. The input data will be read-only, so there will be no problem with maintaining synchronicity of the dataset. The project will try to get the specific index of the data object from the cache, and if that index is not loaded the cache will automatically clear the cache chunk, open the file, load the new object index range to memory, and close the file. It will handle everything, making access to the data object a simple request to the cache in shared memory. There will be a slowdown due to needing to read the file and passing full input objects (rather than references) from the cache, but this performance can be analyzed and adjusted through M.

Answer 2 · 2017-09-04T16:42:23.000Z

Also, in the prior post, I forgot to add the labelled outputs as part of the members in the OutageDataItem.

Answer 3 · 2017-09-20T04:43:12.000Z

Currently being implemented via Issue15.

Answer 4 · 2017-09-22T04:14:02.000Z

Data caching works. Caches are split into slices, which takes a certain amount of items per slice. Each slice has a group id which dictates where in the file the slice originated from. Multiple slices can be from the same group, but each slice within the group takes up one section of the cache (consecutive slices). If a requested index does not match the group number of the slice, the correct slice is read from file. Otherwise, the requested index will come straight from the cache.

A correction still needs to be made to prevent passing of reference of a local variable, but will require consideration how to represent the memory of the stored location.