Big data file compression - comparison

Final work for the Parallel and Distributed Programming class, this program compares the compression algorithms available for Avro, Parquet and ORC file types.

Dependencies

The program depends on the following python modules:

  • pyarrow
  • pyorc
  • pandavro
  • sklearn
  • pandas

Running the program

Simply execute the main script with a python3 interpreter"

python main.py

The program was developed using Python 3.8