/pg2arrow

Primary LanguageC++MIT LicenseMIT

archived, see https://github.com/0x0L/pgeon instead

pg2arrow

pg2arrow is a lightweight tool to parse PostgreSQL binary data into Apache Arrow.

This project is similar to https://github.com/heterodb/pg2arrow and is heavily inspired by it. The main differences are the use of COPY instead of FETCH and that our implementation uses the Arrow C++ API.

Usage

usage: pg2parquet -d conninfo -q query -o output_file

for instance

pg2parquet -d postgresql://localhost/mytests -q "select * from minute_bars" -o test.parquet

TODO

  • General design is not too good

  • Missing numeric, hstore and a few other more esoteric ones

  • python bindings

  • error handling

  • some tests / benchmarks would be nice

  • replace DecoderMap by a more efficient container for our use case: int64_t keys (really) with really small number of elements ?

Type mapping

Most common base types are supported. The following type map is used

PostgreSQL Apache Arrow
bool bool_()
bpchar utf8()
bytea binary()
date date32()
float4 float32()
float8 float64()
int2 int16()
int4 int32()
int8 int64()
interval duration(TimeUnit::MICRO)
json utf8()
jsonb binary()
serial2 int16()
serial4 int32()
serial8 int64()
text utf8()
time time64(TimeUnit::MICRO)
timestamp timestamp(TimeUnit::MICRO)
timestamptz timestamp(TimeUnit::MICRO, "utc")
uuid fixed_size_binary(16)
varchar utf8()
xml utf8()

SQL composite types are mapped to Arrow struct_(...)

SQL arrays are mapped to Arrow list_(...). Only 1D arrays are fully supported. Higher dimensional arrays will be flattened.