csv to parquet and parquet to csv converter

10000ft. Overview

This tool is able to convert .csv files to .parquet files used for columnar storage typically in the Hadoop ecosystem. It is also able to convert .parquet files to .csv files. This is achieved thanks to the 4 built-in Pandas dataframe methods read_csv, read_parquet, to_csv and to_parquet. Because Pandas uses s3fs for AWS S3 integration, so you are free to choose whether the location of the source and/or converted target files is on your local machine or in AWS S3.

How to install and run

setup and activate a virtual environment
pip3 install -r requirements.txt
in case you wish to use AWS S3 as a source file and/or a target file location for the conversion,
set environment variables like:
aws_access_key_id = <your AWS IAM access key id>
aws_secret_access_key = <your AWS IAM secret access key value>
Pandas uses s3fs to integrate with AWS S3, please see https://s3fs.readthedocs.io/en/latest/ in case of any authentication issues.
run python __main__.py with the requiered arguments -sfp for source file path, -tfp for target file path, set like:

for csv to parquet conversion:

local csv file to local parquet file:

-sfp C:\your local folder\source file name.csv
-tfp C:\your local folder\target file name.parquet

local csv file to s3 parquet file:

-sfp C:\your local folder\source file name.csv
-tfp s3://your bucket name/your bucket "folder" prefix/target file name.parquet

s3 csv file to local parquet file:

-sfp s3://your bucket name/your bucket "folder" prefix/source.csv
-tfp C:\your local folder\target.parquet

s3 csv file to s3 parquet file:

-sfp s3://your bucket name/your bucket "folder" prefix/source file name.csv
-tfp s3://your bucket name/your bucket "folder" prefix/target file name.parquet

for parquet to csv conversion:

local parquet file to local csv file:

-sfp C:\your local folder\source file name.parquet
-tfp C:\your local folder\target file name.csv

local parquet file to s3 csv file:

-sfp C:\your local folder\source file name.parquet
-tfp s3://your bucket name/your bucket "folder" prefix/target file name.csv

s3 parquet file to local csv file:

-sfp s3://your bucket name/your bucket "folder" prefix/source file name.parquet
-tfp C:\your local folder\target.csv

s3 parquet file to s3 csv file:

-sfp s3://your bucket name/your bucket "folder" prefix/source file name.parquet
-tfp s3://your bucket name/your bucket "folder" prefix/target file name.csv
you can add these optional arguments:
-cols argument is used to define a subset of columns from the source file, meaning that only the columns passed as a list to this argument will get loaded and converted, example: ["column_name_1", "column_name_2"]
-comp argument is used for overriding the default parquet compression type (snappy) in case of converting from a csv to parquet file, example: gzip

How to verify a parquet file:

http://parquet-viewer-online.com/

datahappy1/csv_to_parquet_converter

csv to parquet and parquet to csv converter

10000ft. Overview

How to install and run

for csv to parquet conversion:

local csv file to local parquet file:

local csv file to s3 parquet file:

s3 csv file to local parquet file:

s3 csv file to s3 parquet file:

for parquet to csv conversion:

local parquet file to local csv file:

local parquet file to s3 csv file:

s3 parquet file to local csv file:

s3 parquet file to s3 csv file:

How to verify a parquet file:

Further documentation