Data Statistics Analyzer
lukehan opened this issue · 0 comments
lukehan commented
1. Overview
We need the statistics data for the following domains:
- Design cube metadata based on query log
- Design HBase row-key based on data distribution (e.g. histogram and cardinality)
- Choose execution plan based on cuboid data
2. Data Analyzer
We need to analyzer the hive data and cube data in 2 phases. Firstly, we will analyze the hive to guide the 1st round design of row key. Then we will analyze the cube data to refine the design of row key and to estimate the cost of query.
2.1. Analyze Hive Data
We need to analyze the following statistics data on hive table:
- Cardinality of each dimension
- Cardinality of dimension combination (optional)
- Value distribution of each dimension (optional)
Based on the statistics of hive data, we can design row key group from high cardinality dimension to low cardinality dimension. BTW, we should evenly split dimension into the row key group that will reduce the number of cuboid.
2.2. Analyze Cube Data
We need to analyze the following statistics on data cube:
- Count of each cuboid
- Group ratio of each cuboid = current cuboid count / lower group base cuboid count
3. Query Analyzer
TBD