A simple data extraction script that can be deployed as a container. This uses Python, with Polars, ConnectorX and Pyarrow. The first two are implemented in Rust, and the second in C++. These libs do the heavy lifting, while Python binds it all.
This can be a lot less resource intensive then running PySpark (uses JVM), and a lot faster than using Pandas (Python implementation).
NAME | DESCRIPTION | DEFAULT |
---|---|---|
DATABASE_URL | Database connection URL with credentials | |
TABLE | DB Table | |
SCHEMA | DB Schema | public |
WRITE_PATH | Destination Path or fsspec URL | |
QUERY_OVERWRITE | Query string to overwrite the default | select * from schema.table |
WRITE_PARTITIONED | If filled, will use column as hive like partition | |
READ_PARTITIONED | Makes the script read the data in parts (in parallel) | |
PARTITION_NUMBER | If READ_PARTITIONED, instructs the amount of them. | 4 |
DATABASE_URL=postgresql://user:password@localhost:5432/cast_concursos TABLE=table_name WRITE_PATH="." WRITE_PARTITIONED="partition_column" python main.py