A mechanism for storing fresh/hot data in the NoSQL database and historical data on Parquet while providing a single access for users (via a view) for easier access to real time and historical data
The view will be created in Presto based on Hive & V3IO KV Once the user creates the view an automated job is created by the interval given: Job creates the view Job deletes the old KV partitions & the old parquet files Job will be running on the App nodes Job is based on crontab
Users will be able to create a view for the “parquez” table using a script Rest call .
view-name : The unified view name (parquet and kv)
partition-by [h / d / m / y] : only time based partition is supported in this phase
partition-interval [1-24h / 1-31d / 1-12m / 1-Ny] : Partition creation interval .
real-time-table-name : The KV table for the view, need to specify the full path)
real-time-window window [h / d / m / y] : The time window for storing data in key value (hot data)
historical-retention [h / d / m / y] : The retention of all parquez data
config : config file path
[v3io]
v3io_container = bigdata
[hive]
hive_schema = default
[presto]
uri = <pesto_uri>
v3io_connector = v3io
hive_connector = hive
[nginx]
v3io_api_endpoint_host = <v3io_api_endpoint_host>
v3io_api_endpoint_port = 443
[compression]
type = Parquet
coalesce = 6
- parquez scripts
- partitioned kv table
- enable hive in presto service
- set hive.allow-drop-table=true
- set hive.non-managed-table-writes-enabled=true
Clone this repository and cd
into it:
mkdir parquez && \
git clone https://github.com/iguazio/parquez.git && \
cd parquez
Run parquez
from jupyter run parquez.ipynb