ratschlab-common
Small library of common code used in various projects in the ratschlab.
Features
- Writing parquet and HDF5 files with sensible defaults
ratschlab_common.io.dataframe_formats
. - Support for working with 'chunkfiles', i.e. splitting up a large
dataset in smaller chunks which can be processed independently (see
example notebook):
- Repartition records (i.e. increase or decrease number of chunkfiles) while keeping data belonging together in the same file (e.g. data with the same patient id associated)
- simple indexing for looking up in which chunk to find data belonging e.g. to a patient
- bigmatrix: support for creating and reading large matrices stored in HDF5 having additional metadata on the axes in form of data frames (see example notebook.)
- small wrappers for spark and dask (spark example.)
- saving sparse
pandas
dataframes to hdf5, see example notebook
Tools
ratschlab-common
also comes with some command line tools:
pq-tool
: inspect parquet files on the command linepq-tool head
: first recordspq-tool tail
: last recordspq-tool cat
: all recordspq-tool schema
: schema of a parquet file
export-db-to-files
: Tool to dump (postgres) database tables into parquet files. Large tables can be partitioned on a key and dumped into separate file chunks. This allows for further processing to be easily done in parallel.bigmatrix-repack
: rechunking/packing bigmatrix hdf5 files
Installation and Requirements
The library along with all the required dependencies can be installed with:
pip install ratschlab-common[complete]
Depending on whether you plan to use spark
or dask
or none of them you could install
ratschlab-common
through either of the commands
pip install ratschlab-common
pip install ratschlab-common[spark]
pip install ratschlab-common[dask]
Note, that if you plan on using spark
make sure, you have
Java 8 and either python 3.6 or 3.7 installed (python 3.8 is currently not supported by pyspark
).