ltelab/disdrodb

Using Arrow to further speed up raw data I/O

ghiggi opened this issue · 0 comments

ghiggi commented

Prework

  • Read and agree to the code of conduct.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
  • Runnable
  • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.

Description

Evaluate the benefits of using:

  • the engine="arrow" in read.csv to read the raw data using multithreading,
  • the arrow dtype backend introduced in pandas 2.0 to decrease the memory usage of string columns in pd.DataFrame

Please describe the performance issue.

Benchmarks

How poorly does DISDRODB perform?