red-data-tools/red-datasets

Improve loading/parsing speed in 'arrowable' environment

heronshoes opened this issue · 2 comments

It takes a long time to read a large dataset from a source for the first time.

I created a fresh Docker environment for my dataframe example and found it very time consuming to pull a large dataset of nycflights13.

If you use red-dataset-arrow, the cache is stored in the arrow file, but the first time you load it, it takes a long time to load and parse because it uses Ruby's CSV.

Is it possible to make the environment extended with red-dataset-arrow use arrow to load and parse?

kou commented

How about adding Datasets::CSVParser like Datasets::ZipExtractor and extending Datasets::CSVParser in red-datasets-arrow?

Thanks @kou .

I will make a try to add Datasets::CSVParser first!