make tool to split arbitrary big cvs files into directories of arrow files

Question

make tool to split arbitrary big cvs files into directories of arrow files

Closed this issue 4 years ago · 3 comments

connected to #6

I would like to make a tool which takes a arbitrary large CSV file,
splits in in any number of partitions (using 1 to n columns as "groups")
and creates directory hierarchy accordingly and writes arrow files for each group

Having that and #6 implemented, we could work with very big CSV files.

Answer 1 · 2020-10-23T16:47:33.000Z

I think this can be done "rather easy" in a few lines of R code.

step: Split large CSV file into lots of CSV files on disk
- readr::read_csv_chunked combined with
- write_csv (append = true)
  should allow to go over a CSV file chunk by chunk and write the pieces "line-by-line" to the target CSV files.
  This should hardly use any memory (controlled by chunk size)
Step
Package arrow has a way to convert "hierarchies of CSV" files to "hierarchy of arrow files",.
This should work, hopefully
Delete the temporary CSV files

The CSV parser of readr::read_csv_chunked is in my view, very, very good and fast.
So we can leave the CSV parsing to them.

Answer 2 · 2020-10-23T20:14:38.000Z

t.m.d doesn't support writing arrow currently, so the idea with converter can be addressed to clojisr. I agree that your procedure can be easy.

Answer 3 · 2020-11-14T17:21:10.000Z

Maybe we can close this for now.
I made a tool which can partition CSVs accordingly:
https://github.com/behrica/csvsplit

In the last weeks, geni and TMD got new features regarding big data formats (arrow, parquet)
I think we are "done", with this.

Clojure is ready for analysing big data.

Geni - as large as you cluster can scale
TMD - big, as longs as it fits in RAM