Copy from/to Parquet files in PostgreSQL!
pg_parquet
is a PostgreSQL extension that allows you to read and write Parquet files, which are located in S3
or file system
, from PostgreSQL via COPY TO/FROM
commands. It depends on Apache Arrow project to read and write Parquet files and pgrx project to extend PostgreSQL's COPY
command.
-- Copy a query result into Parquet in S3
COPY (SELECT * FROM table) TO 's3://mybucket/data.parquet' WITH (format 'parquet');
-- Load data from Parquet in S3
COPY table FROM 's3://mybucket/data.parquet' WITH (format 'parquet');
- Installation From Source
- Usage
- Object Store Support
- Copy Options
- Configuration
- Supported Types
- Postgres Support Matrix
After installing Postgres
, you need to set up rustup
, cargo-pgrx
to build the extension.
# install rustup
> curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# install cargo-pgrx
> cargo install cargo-pgrx
# configure pgrx
> cargo pgrx init --pg17 $(which pg_config)
# append the extension to shared_preload_libraries in ~/.pgrx/data-17/postgresql.conf
> echo "shared_preload_libraries = 'pg_parquet'" >> ~/.pgrx/data-17/postgresql.conf
# run cargo-pgrx to build and install the extension
> cargo pgrx run
# create the extension in the database
psql> "CREATE EXTENSION pg_parquet;"
There are mainly 3 things that you can do with pg_parquet
:
- You can export Postgres tables/queries to Parquet files,
- You can ingest data from Parquet files to Postgres tables,
- You can inspect the schema and metadata of Parquet files.
You can use PostgreSQL's COPY
command to read and write Parquet files. Below is an example of how to write a PostgreSQL table, with complex types, into a Parquet file and then to read the Parquet file content back into the same table.
-- create composite types
CREATE TYPE product_item AS (id INT, name TEXT, price float4);
CREATE TYPE product AS (id INT, name TEXT, items product_item[]);
-- create a table with complex types
CREATE TABLE product_example (
id int,
product product,
products product[],
created_at TIMESTAMP,
updated_at TIMESTAMPTZ
);
-- insert some rows into the table
insert into product_example values (
1,
ROW(1, 'product 1', ARRAY[ROW(1, 'item 1', 1.0), ROW(2, 'item 2', 2.0), NULL]::product_item[])::product,
ARRAY[ROW(1, NULL, NULL)::product, NULL],
now(),
'2022-05-01 12:00:00-04'
);
-- copy the table to a parquet file
COPY product_example TO '/tmp/product_example.parquet' (format 'parquet', compression 'gzip');
-- show table
SELECT * FROM product_example;
-- copy the parquet file to the table
COPY product_example FROM '/tmp/product_example.parquet';
-- show table
SELECT * FROM product_example;
You can call SELECT * FROM parquet.schema(<uri>)
to discover the schema of the Parquet file at given uri.
You can call SELECT * FROM parquet.metadata(<uri>)
to discover the detailed metadata of the Parquet file, such as column statistics, at given uri.
You can call SELECT * FROM parquet.file_metadata(<uri>)
to discover file level metadata of the Parquet file, such as format version, at given uri.
You can call SELECT * FROM parquet.kv_metadata(<uri>)
to query custom key-value metadata of the Parquet file at given uri.
pg_parquet
supports reading and writing Parquet files from/to S3
object store. Only the uris with s3://
scheme is supported.
The simplest way to configure object storage is by creating the standard ~/.aws/credentials
and ~/.aws/config
files:
$ cat ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
$ cat ~/.aws/config
[default]
region = eu-central-1
Alternatively, you can use the following environment variables when starting postgres to configure the S3 client:
AWS_ACCESS_KEY_ID
: the access key ID of the AWS accountAWS_SECRET_ACCESS_KEY
: the secret access key of the AWS accountAWS_REGION
: the default region of the AWS accountAWS_SHARED_CREDENTIALS_FILE
: an alternative location for the credentials fileAWS_CONFIG_FILE
: an alternative location for the config fileAWS_PROFILE
: the name of the profile from the credentials and config file (default profile name isdefault
)
Note
To be able to write into a object store location, you need to grant parquet_object_store_write
role to your current postgres user.
Similarly, to read from an object store location, you need to grant parquet_object_store_read
role to your current postgres user.
pg_parquet
supports the following options in the COPY TO
command:
format parquet
: you need to specify this option to read or write Parquet files which does not end with.parquet[.<compression>]
extension. (This is the only option thatCOPY FROM
command supports.),row_group_size <int>
: the number of rows in each row group while writing Parquet files. The default row group size is122880
,row_group_size_bytes <int>
: the total byte size of rows in each row group while writing Parquet files. The default row group size bytes isrow_group_size * 1024
,compression <string>
: the compression format to use while writing Parquet files. The supported compression formats areuncompressed
,snappy
,gzip
,brotli
,lz4
,lz4raw
andzstd
. The default compression format issnappy
. If not specified, the compression format is determined by the file extension.compression_level <int>
: the compression level to use while writing Parquet files. The supported compression levels are only supported forgzip
,zstd
andbrotli
compression formats. The default compression level is6
forgzip (0-10)
,1
forzstd (1-22)
and1
forbrotli (0-11)
.
There is currently only one GUC parameter to enable/disable the pg_parquet
:
pg_parquet.enable_copy_hooks
: you can set this parameter toon
oroff
to enable or disable thepg_parquet
extension. The default value ison
.
pg_parquet
has rich type support, including PostgreSQL's primitive, array, and composite types. Below is the table of the supported types in PostgreSQL and their corresponding Parquet types.
PostgreSQL Type | Parquet Physical Type | Logical Type |
---|---|---|
bool |
BOOLEAN | |
smallint |
INT16 | |
integer |
INT32 | |
bigint |
INT64 | |
real |
FLOAT | |
oid |
INT32 | |
double |
DOUBLE | |
numeric (1) |
FIXED_LEN_BYTE_ARRAY(16) | DECIMAL(128) |
text |
BYTE_ARRAY | STRING |
json |
BYTE_ARRAY | STRING |
bytea |
BYTE_ARRAY | |
date (2) |
INT32 | DATE |
timestamp |
INT64 | TIMESTAMP_MICROS |
timestamptz (3) |
INT64 | TIMESTAMP_MICROS |
time |
INT64 | TIME_MICROS |
timetz (3) |
INT64 | TIME_MICROS |
geometry (4) |
BYTE_ARRAY |
PostgreSQL Type | Parquet Physical Type | Logical Type |
---|---|---|
composite |
GROUP | STRUCT |
array |
element's physical type | LIST |
crunchy_map (5) |
GROUP | MAP |
Warning
- (1) The
numeric
types with <=38
precision is represented asFIXED_LEN_BYTE_ARRAY(16)
withDECIMAL(128)
logical type. Thenumeric
types with >38
precision is represented asBYTE_ARRAY
withSTRING
logical type. - (2) The
date
type is represented according toUnix epoch
when writing to Parquet files. It is converted back according toPostgreSQL epoch
when reading from Parquet files. - (3) The
timestamptz
andtimetz
types are adjusted toUTC
when writing to Parquet files. They are converted back withUTC
timezone when reading from Parquet files. - (4) The
geometry
type is represented asBYTE_ARRAY
encoded asWKB
whenpostgis
extension is created. Otherwise, it is represented asBYTE_ARRAY
withSTRING
logical type. - (5)
crunchy_map
is dependent on functionality provided by Crunchy Bridge. Thecrunchy_map
type is represented asGROUP
withMAP
logical type whencrunchy_map
extension is created. Otherwise, it is represented asBYTE_ARRAY
withSTRING
logical type.
Warning
Any type that does not have a corresponding Parquet type will be represented, as a fallback mechanism, as BYTE_ARRAY
with STRING
logical type. e.g. enum
pg_parquet
is tested with the following PostgreSQL versions:
PostgreSQL Major Version | Supported |
---|---|
17 | ✅ |
16 | ✅ |