Auerbach-Lab/Behavior-autoanalysis

Archive data format / Speed of reads and writes

Closed this issue · 2 comments

tofof commented

Rdata files are already too slow to read and write the trials archive.

Possible alternatives include CSVs with fread or vroom and the ability to do appending writes, or feather / Apache arrow cross-platform format.

tofof commented

All tests were to files on a mirrored hardware raid of 'spinning rust' HDDs.

Scientific notation:

  • Rdata doesn't recognize scientific notation present in some archives, and treats columns with 'e-05' in the numerical data as character strings.
  • Vroom and Fread both are aware of scientific notation and correctly process these values as numeric.

Size:

  • Vroom_write and Fwrite produce effectively identically sized csv/tsv files.
  • Rdata ascii produces worse filesize than csv/tsv.
  • Rdata compressed binary produces better filesize

Speed (round-trip)

  • Rdata save/load as uncompressed ascii (currently in use) is ~36.6s for 159 MB file
  • Rdata save/load with default arguments (compressed binary) on fmr1 is ~4.4s for 21 MB file
  • Rdata save/load as uncompressed binary is ~1.3s for 190 MB file
  • Fwrite/fread provided by data.table is ~1.6s for 123 MB file
  • Fwrite/fread appending is instantaneous (load is then 0.2s) fastest append roundtrip
  • Fwrite/fread using compression to gz is ~1.4s for 26 MB file, best option for compressed data
  • Fwrite/fread appending and compressing is instantaneous, (load is then 0.6s)
  • Vroom is ~0.6s for 123 MB file fastest whole roundtrip
  • Vroom with altrep mode off is 0.9s
  • Vroom appending is instantaneous, (load is then 0.3)
  • readr's versions are known to be slower than fread
  • read.csv is known to be slower still / slower than readr

Filesystem:

  • fwrite append gives helpful warning if the appended data is not the same length as the existing file
  • vroom leaves handle attached to a file it's read, which makes subsequent writes to that file difficult, but this only affects non-append writes. Furthermore, this issue can be avoided entirely by not using vroom's index-and-lazyload by doing the read with altrep = FALSE.
  • vroom silently drops list columns and other complicated data structures seen in run_archive while writing to disk, with no warning that it didn't write the entire expected contents

vroom silently drops data when writing if it's too complicated (no warning, no error)

(said again for emphasis)

tofof commented

Winner: data.table's fread/fwrite.

  • More mature and fewer concerning issues in the package's github tracker (eg silently truncating if mismatched quotes are in the data).
  • Supports compression, which vroom lacks
  • Compression outperforms binary rdata on speed and is comparable on size, and supports appending in compression
  • Warns on shape of data
  • Does not silently drop data when it encounters e.g. list structures inside the data, instead erroring -- and better still, the sep2 argument is intended to support list structures and list columns (implementation progress unclear)