read_csv
, read_parquet
, to_csv
and to_parquet
.
Because Pandas uses s3fs
for AWS S3 integration, so you are free to choose whether the location of the source and/or converted target files is on your local machine or in AWS S3.
-
setup and activate a virtual environment
-
pip3 install -r requirements.txt
-
in case you wish to use AWS S3 as a source file and/or a target file location for the conversion,
set environment variables like:
aws_access_key_id = <your AWS IAM access key id>
aws_secret_access_key = <your AWS IAM secret access key value>
Pandas uses s3fs to integrate with AWS S3, please see https://s3fs.readthedocs.io/en/latest/ in case of any authentication issues. -
run
python __main__.py
with the requiered arguments-sfp
for source file path,-tfp
for target file path, set like:-sfp C:\your local folder\source file name.csv
-tfp C:\your local folder\target file name.parquet
-sfp C:\your local folder\source file name.csv
-tfp s3://your bucket name/your bucket "folder" prefix/target file name.parquet
-sfp s3://your bucket name/your bucket "folder" prefix/source.csv
-tfp C:\your local folder\target.parquet
-sfp s3://your bucket name/your bucket "folder" prefix/source file name.csv
-tfp s3://your bucket name/your bucket "folder" prefix/target file name.parquet
-sfp C:\your local folder\source file name.parquet
-tfp C:\your local folder\target file name.csv
-sfp C:\your local folder\source file name.parquet
-tfp s3://your bucket name/your bucket "folder" prefix/target file name.csv
-sfp s3://your bucket name/your bucket "folder" prefix/source file name.parquet
-tfp C:\your local folder\target.csv
-sfp s3://your bucket name/your bucket "folder" prefix/source file name.parquet
-tfp s3://your bucket name/your bucket "folder" prefix/target file name.csv
-
you can add these optional arguments:
-cols
argument is used to define a subset of columns from the source file, meaning that only the columns passed as alist
to this argument will get loaded and converted, example:["column_name_1", "column_name_2"]
-comp
argument is used for overriding the default parquet compression type (snappy
) in case of converting from a csv to parquet file, example:gzip
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html
- https://s3fs.readthedocs.io/en/latest/