Data Statistics Analyzer

Question

Data Statistics Analyzer

lukehan opened this issue 10 years ago · 0 comments

1. Overview

We need the statistics data for the following domains:

Design cube metadata based on query log
Design HBase row-key based on data distribution (e.g. histogram and cardinality)
Choose execution plan based on cuboid data

2. Data Analyzer

We need to analyzer the hive data and cube data in 2 phases. Firstly, we will analyze the hive to guide the 1st round design of row key. Then we will analyze the cube data to refine the design of row key and to estimate the cost of query.

2.1. Analyze Hive Data

We need to analyze the following statistics data on hive table:

Cardinality of each dimension
Cardinality of dimension combination (optional)
Value distribution of each dimension (optional)
Based on the statistics of hive data, we can design row key group from high cardinality dimension to low cardinality dimension. BTW, we should evenly split dimension into the row key group that will reduce the number of cuboid.

2.2. Analyze Cube Data

We need to analyze the following statistics on data cube:

Count of each cuboid
Group ratio of each cuboid = current cuboid count / lower group base cuboid count

3. Query Analyzer

TBD