databundle
facilitates the aggregation of tabular data across variable
sources (flat files, PostgreSQL databases). It generates a single file that
contains the combined data from all sources stored in an efficient format.
For now, both HDF5 and parquet can be used for serialization which allows for efficient storage, fast data loading and interoperability across many programming languages.
-
We want to extract data from local databases and easily transport them to another compute server.
-
We want to automatically fetch data from various online ressources (e.g. from public SQL databases) into a single file.
git clone git@github.com:legaultmarc/databundle.git
pushd databundle
pip install -e .
Bundling is relatively simple. First, you define the data sources and serialization backend:
serializer:
output_filename: my_data_cache
backend_name: parquet
backend_parameters:
engine: pyarrow
sources:
# Data from a local SQL database
- name: hospitalization_data
type: postgresql
source_parameters:
sql: >
select distinct sample_id::TEXT as sample_id, diagnosis_code
from hospitalization_data
union
select distinct eid::TEXT, code
from secondary_hospitalization_data
dbname: some_psql_database
host: hostname.domain
user: database_username
# Data from a delimited file
- name: continuous_variables
type: flat_file
source_parameters:
filename: test_data/dense.csv
delimiter: ','
And then the databundle my_config.yaml
command can be used to generate the
bundle.
Deserialization can be done manually or by recreating the Serde instance:
import databundle.core
serde = databundle.core.ParquetSerde("my_data_cache.tar")
data = serde.deserialize()
Available now
-
Supported sources:
- Flat files (delimited files)
- PostgreSQL databases
-
Handling of missing values (done automatically by Pandas)
-
Easy configuration with YAML and command-line interface
Wishlist
-
Supported sources:
- MySQL
- sqlite
- HTTP / JSON
-
Backends:
- Feather