scicloj/tablecloth

make tool to split arbitrary big cvs files into directories of arrow files

Closed this issue · 3 comments

connected to #6

I would like to make a tool which takes a arbitrary large CSV file,
splits in in any number of partitions (using 1 to n columns as "groups")
and creates directory hierarchy accordingly and writes arrow files for each group

Having that and #6 implemented, we could work with very big CSV files.

I think this can be done "rather easy" in a few lines of R code.

  1. step: Split large CSV file into lots of CSV files on disk

    • readr::read_csv_chunked combined with
    • write_csv (append = true)
      should allow to go over a CSV file chunk by chunk and write the pieces "line-by-line" to the target CSV files.
      This should hardly use any memory (controlled by chunk size)
  2. Step
    Package arrow has a way to convert "hierarchies of CSV" files to "hierarchy of arrow files",.
    This should work, hopefully

  3. Delete the temporary CSV files

The CSV parser of readr::read_csv_chunked is in my view, very, very good and fast.
So we can leave the CSV parsing to them.

t.m.d doesn't support writing arrow currently, so the idea with converter can be addressed to clojisr. I agree that your procedure can be easy.

Maybe we can close this for now.
I made a tool which can partition CSVs accordingly:
https://github.com/behrica/csvsplit

In the last weeks, geni and TMD got new features regarding big data formats (arrow, parquet)
I think we are "done", with this.

Clojure is ready for analysing big data.

Geni - as large as you cluster can scale
TMD - big, as longs as it fits in RAM