red-data-tools/red-datasets

Large data sets

bkmgit opened this issue · 2 comments

It may be good to have a different way to work with large datasets. For example the https://ldbcouncil.org/benchmarks/graphalytics/ data sets are 1.1Tb in total.

kou commented

Do you have any idea?

What should we care about it? Local storage size? Download time? ...?

  1. Can have a threshold for streaming data from disk, instead of reading all data into memory. Have a default such as 500Mb which the user can adjust
  2. Local storage may be an issue, perhaps ask the user if they want to proceed and give an estimate of storage space required.
  3. For download time, cannot do much about this, on Linux wget -c is helpful for continuing an incomplete download without starting again. If the data is stored on the cloud in a suitable form, one can stream the interesting portion, but this requires infrastructure allows this and perhaps is another step for the future. At present want to consider datasets upto 100 Gb which may be analyzed on a workstation.