/csv_to_parquet_converter

csv to parquet and vice versa file converter based on Pandas written in Python3

Primary LanguagePythonMIT LicenseMIT

csv to parquet and parquet to csv converter

10000ft. Overview

This tool is able to convert .csv files to .parquet files used for columnar storage typically in the Hadoop ecosystem. It is also able to convert .parquet files to .csv files. This is achieved thanks to the 4 built-in Pandas dataframe methods read_csv, read_parquet, to_csv and to_parquet. Because Pandas uses s3fs for AWS S3 integration, so you are free to choose whether the location of the source and/or converted target files is on your local machine or in AWS S3.

How to install and run

  1. setup and activate a virtual environment

  2. pip3 install -r requirements.txt

  3. in case you wish to use AWS S3 as a source file and/or a target file location for the conversion,
    set environment variables like:
    aws_access_key_id = <your AWS IAM access key id>
    aws_secret_access_key = <your AWS IAM secret access key value>
    Pandas uses s3fs to integrate with AWS S3, please see https://s3fs.readthedocs.io/en/latest/ in case of any authentication issues.

  4. run python __main__.py with the requiered arguments -sfp for source file path, -tfp for target file path, set like:

    for csv to parquet conversion:

    local csv file to local parquet file:

    -sfp C:\your local folder\source file name.csv
    -tfp C:\your local folder\target file name.parquet

    local csv file to s3 parquet file:

    -sfp C:\your local folder\source file name.csv
    -tfp s3://your bucket name/your bucket "folder" prefix/target file name.parquet

    s3 csv file to local parquet file:

    -sfp s3://your bucket name/your bucket "folder" prefix/source.csv
    -tfp C:\your local folder\target.parquet

    s3 csv file to s3 parquet file:

    -sfp s3://your bucket name/your bucket "folder" prefix/source file name.csv
    -tfp s3://your bucket name/your bucket "folder" prefix/target file name.parquet

    for parquet to csv conversion:

    local parquet file to local csv file:

    -sfp C:\your local folder\source file name.parquet
    -tfp C:\your local folder\target file name.csv

    local parquet file to s3 csv file:

    -sfp C:\your local folder\source file name.parquet
    -tfp s3://your bucket name/your bucket "folder" prefix/target file name.csv

    s3 parquet file to local csv file:

    -sfp s3://your bucket name/your bucket "folder" prefix/source file name.parquet
    -tfp C:\your local folder\target.csv

    s3 parquet file to s3 csv file:

    -sfp s3://your bucket name/your bucket "folder" prefix/source file name.parquet
    -tfp s3://your bucket name/your bucket "folder" prefix/target file name.csv

  5. you can add these optional arguments:
    -cols argument is used to define a subset of columns from the source file, meaning that only the columns passed as a list to this argument will get loaded and converted, example: ["column_name_1", "column_name_2"]
    -comp argument is used for overriding the default parquet compression type (snappy) in case of converting from a csv to parquet file, example: gzip

How to verify a parquet file:

Further documentation