hive-hbase-handler: A Java repository from sevenseablue

Hbase Storage Handler

官方的hbase hadnler介绍 Link for documentation

这个工程,基于官方的hbase handler,定制了一些特性,下面是使用介绍和测试

概要

handler不使用hbase的读写api,只使用snapshot和bulkload的api,对hbase系统影响很小,有一些读写的原则

低频使用, 日或者小时, 不可以实时使用
读, 先读到hdfs, 然后再做其他的维度处理, 绝对不能做实时读取的应用.
写, 对应的是, 将需要多次查询hive的转移到hbase上, 那种需要实时,近实时,一天很多次, 按key查询的需求转移到hbase上. 把hive的数据写到hbase, 再进行相应的读

读写case

只允许以下几种读写方式

读全量到hive表
读部分列到hive表
读部分行到hive表, 按rowkey的条件
写, rowkey唯一且不空, 按顺序写

1 create table

CREATE TABLE t1g_hb(col0 string, 
                    col1 string, 
                    col2 string, 
                    col3 string)
STORED BY 'com.wdw.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key, f:col1, f:col2, f:col3')
TBLPROPERTIES("hbase.table.name" = "data:t1g_hb","hbase.table.family.name"="f", "hbase.mapred.output.outputtable" = "data:t1g_hb");

必须是 external table, 非external表在drop 表时会drop掉hbase里的对应表
创建hive hbase handler表时，第一个字段值对应hbase rowkey

key	val
STORED BY	com.wdw.hive.hbase.HBaseStorageHandler
WITH SERDEPROPERTIES
hbase.columns.mapping'	:key, f:col1, f:col2	rowkey, columnfamily:columnname
TBLPROPERTIES
hbase.table.name	data:t1g_hb
hbase.table.family.name	f
hbase.mapred.output.outputtable	data:t1g_hb

2 insert, 一般数据量小, 小于3g, hbase表可以只有一个region

insert into t1g_hb select * from t1g where col0 is not null and col0 != '' cluster by col0;

rowkey字段不能为NULL|null或者空, 需要过滤, key is not null and key != ''
rowkey字段有序且不能有重复, 使用 cluster by rowkey

3 insert, 数据量大, 几十g,百g,几t等, hbase表需要有多个eregion

insert into t100g_hb select * from t100g cluster by col0;

4 read, read all, read by key range, read by part of columns

insert into t100g_hb_91_f select * from t100g_hb_91;
insert into t100g_hb_91_f2 select col0, col1, col10, col20 from t100g_hb_91 where col0>='v1' and col0<'v2';

测试

测试集群hadoop 31 datanode, hbase 7 regionserver, 400 vcores

线上集群 900 vcores

功能测试

性能测试写

hive表为50个字段,分别是1-50的长度机英文数字字符, textfile存储

生成hbase file，无压缩3.2g， snappy 1.8g

test

行数	数据量	region数	hbase api	hfile写cp	hfile写mv/hash	hfile写mv/totalorder
100w	1.2g	1	200.898	172.024	136.259
1kw	12g	1	1663.148		166.473
1ww	120g	1			529.338
1ww	120g	91			34h+ (计算得, 没跑完, split一次30+分钟)	582.638
10ww	1.2t	901

prod

行数	数据量	region数	hfile写mv/totalorder
100w	1.2g	1	164.981
1kw	12g	10	223.088
1ww	120g	91	278
10ww	1.2t	901	504.685

因为一个region一个reduce进程来处理, 所以有数据倾斜后慢的问题,

写的速度瓶颈是所有region当中写入量最大的那个量有多大, , 将来可以扩展到在start keys中再添加一些key

性能测试读

hbase表为50个字段,分别是1-50的长度机英文数字字符

test 350 vcore, hive表为textfile格式

行数	数据量	region数	读hfile
100w	2.9g	1	174.318 seconds
1kw	29g	10
1ww	290g	91	253.311

prod 900 vcore, hive table orc, 1g*region_num

行数	数据量	region数	读hfile
100w	2.9g	1	203.091
1kw	29g	10	204.402
1ww	290g	91	352.919
10ww	2.9t	901	252.825

由于一个hfile一个map读, 最终还是最大的hfile大小是最大的瓶颈

结论

读写的瓶颈都是最大的region的读写

sevenseablue/hive-hbase-handler