csv2parquet: A Python repository from press0

CSV <=> Parquet transform utility powered by Apache Arrow.

Apache Arrow powers the Apache Parquet and Apache Spark projects.

CSV2Parquet.py

Motivation

Cloud data platforms rely on Parquet; data analysts rely on CSV.

Getting Started

Install the application by running the following commands:

git clone https://github.com/press0/csv2parquet.git
cd csv2parquet
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Test objective

verify the CSV <=> Parquet transform is reversible.

Approach:

transform a csv file into a Parquet file  
transform the Parquet file back to a 2nd CSV file  
transform the 2 CSV files into Pandas DataFrames  
compare the two Pandas DataFrames for equality

TestCSV2Parquet.py

python TestCSV2Parquet.py

CSV <=> Parquet transform with AWS Glue

PySpark script performs the CSV to Parquet transform on the AWS Glue service

AwsCvs2ParquetGlue.py