billpwchan/futu_algo

Use Parquet instead of CSV for better performance

billpwchan opened this issue · 2 comments

  1. Refactor the code so all the I/O related operations are encapsulated in data_engine
  2. Convert CSV to Parquet with 2X saving in file size...Amazing
  3. Create a utility function for ppl who've already downloaded historical data in CSV format, and they can easily use the utility function to convert all CSV to Parquet ease.

Fixed in e8808cc

A quick comparision between CSV and Parquet

CSV Parquet
Row-based storage format. A hybrid of Row-based and column-based storage formats.
It consumes a lot of space as no default compression option is available. For example, a 1TB file will occupy the same space when stored on Amazon S3 or any other cloud. Compresses data while storing, thus consuming less space. A 1 TB file stored in Parquet format will take up only 130GB of space.
Query run time is slow because of the row-based search. For each column, every row of data has to be retrieved. Query time is about 34 times faster because of the column-based storage and presence of metadata.
More data has to be scanned per query. About 99% less data is scanned for the execution of the query, thus optimizing performance.
Most storage devices charge based on the storage space, so CSV format means the high storage cost. Less storage cost as data is stored in compressed, encoded format.
File schema has to be either inferred (leading to errors) or supplied (tedious). File schema is stored in the metadata.
The format is suitable for simple data types. Parquet is suitable even for complex types like nested schemas, arrays, dictionaries.

Credit: https://geekflare.com/parquet-csv-data-storage/