Using Arrow to further speed up raw data I/O
ghiggi opened this issue · 0 comments
ghiggi commented
Prework
- Read and agree to the code of conduct.
- If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
- Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- Runnable
- Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
Description
Evaluate the benefits of using:
- the
engine="arrow"
inread.csv
to read the raw data using multithreading, - the
arrow
dtype backend introduced in pandas 2.0 to decrease the memory usage of string columns inpd.DataFrame
Please describe the performance issue.
Benchmarks
How poorly does DISDRODB perform?