Improve loading/parsing speed in 'arrowable' environment

Question

Improve loading/parsing speed in 'arrowable' environment

heronshoes opened this issue 2 years ago · 2 comments

It takes a long time to read a large dataset from a source for the first time.

I created a fresh Docker environment for my dataframe example and found it very time consuming to pull a large dataset of nycflights13.

If you use red-dataset-arrow, the cache is stored in the arrow file, but the first time you load it, it takes a long time to load and parse because it uses Ruby's CSV.

Is it possible to make the environment extended with red-dataset-arrow use arrow to load and parse?

Answer 1 · 2023-04-01T13:49:55.000Z

How about adding Datasets::CSVParser like Datasets::ZipExtractor and extending Datasets::CSVParser in red-datasets-arrow?

Answer 2 · 2023-04-02T01:44:20.000Z

Thanks @kou .

I will make a try to add Datasets::CSVParser first!