apache hbase bulk import data from csv files and include basic operate example

apache hadoop能够在合理的时间范围内处理PB级的数据,在研读hadoop的过程中发现了一个处理随机读写的系统,它叫做apache hbase。或者将其称为目前流行的一种新的数据存储架构,传统数据库解决大数据问题时成本更高,hbase利用了hdfs存储海量数据能力(不用考虑集群扩容及分库、分表内容),并提供像传统RDBMS查询操作。

Google Big Table论文为Apache Hbase的设计提供了理论依据:https://static.googleusercontent.com/media/research.google.com/zh-CN//archive/bigtable-osdi06.pdf


apache hadoop安装指南:https://www.jianshu.com/p/4c81a1e32161

apache hbase安装指南:https://www.guru99.com/hbase-installation-guide.html

1. 应用本地部署流程:

1)将仓库代码克隆到本地 git clone https://github.com/SamMACode/hbase-stranger-example.git



warning: 需要特别注意,之前在本地开发时配置zookeeper地址为具体ip,导致了一直连不上hbase服务器。其原因是hbasezookeeper中注册的是主机的域名,对于具体ip地址其无法进行处理。导出 hbase client客户端在一直进行连接重试,之后会进行报错(排查此问题花费了很长时间)。


hbase(main):002:0> create 'fakenames', 'personal', 'contactinfo', 'creditcard'
0 row(s) in 2.4060 seconds
=> Hbase::Table - fakenames


sam@sam-virtual-machine:~/repositories$ java -jar hbase-sample-0.0.1.jar fakenames-sample-1000.csv
2020-04-15 15:22:48.361  INFO 37109 --- [           main] c.p.explore.hbase.HbaseOperateApp        : Starting HbaseOperateApp v0.0.1 on sam-virtual-machine with PID 37109 (/home/sam/repositories/hbase-sample-0.0.1.jar started by sam in /home/sam/repositories)
2020-04-15 15:22:48.367  INFO 37109 --- [           main] c.p.explore.hbase.HbaseOperateApp        : No active profile set, falling back to default profiles: default
2020-04-15 15:22:49.363  INFO 37109 --- [           main] c.p.explore.hbase.conf.HbaseConfig       : connect to apache hbase server, config: [Configuration: core-default.xml, core-site.xml, hbase-default.xml, hbase-site.xml] zkhost: [192.168.xxx.xxx] zkport: [2181] 
2020-04-15 15:22:49.519  INFO 37109 --- [           main] c.p.explore.hbase.HbaseOperateApp        : Started HbaseOperateApp in 1.681 seconds (JVM running for 2.067)
2020-04-15 15:22:49.648  INFO 37109 --- [           main] c.p.explore.hbase.HbaseOperateApp        : command line args: ["fakenames-sample-1000.csv"]
2020-04-15 15:22:50.341 DEBUG 37109 --- [           main] o.a.hadoop.hbase.ipc.AbstractRpcClient   : Codec=org.apache.hadoop.hbase.codec.KeyValueCodec@491cc5c9, compressor=null, tcpKeepAlive=true, tcpNoDelay=true, connectTO=10000, readTO=20000, writeTO=60000, minIdleTimeBeforeClose=120000, maxRetries=0, fallbackAllowed=false, bind address=null
2020-04-15 15:22:50.408  INFO 37109 --- [           main] c.p.e.h.service.BulkImportFakenamesData  : now at line [100] rowKey [elsass-june-g-100]
2020-04-15 15:22:50.458  INFO 37109 --- [           main] c.p.e.h.service.BulkImportFakenamesData  : now at line [200] rowKey [franz-michael-b-200]
2020-04-15 15:22:50.493  INFO 37109 --- [           main] c.p.e.h.service.BulkImportFakenamesData  : now at line [300] rowKey [pua-karl-m-300]
2020-04-15 15:22:50.508  INFO 37109 --- [           main] c.p.e.h.service.BulkImportFakenamesData  : now at line [400] rowKey [truax-robin-a-400]
2020-04-15 15:22:50.528  INFO 37109 --- [           main] c.p.e.h.service.BulkImportFakenamesData  : now at line [500] rowKey [cashin-karla-s-500]
2020-04-15 15:22:50.532  INFO 37109 --- [           main] c.p.e.h.service.BulkImportFakenamesData  : now at line [600] rowKey [woodson-benjamin-n-600]
2020-04-15 15:22:50.618 DEBUG 37109 --- [s-master1:2181)] o.a.h.h.s.o.apache.zookeeper.ClientCnxn  : Reading reply sessionid:0x100cf6bbe02003b, packet:: clientPath:null serverPath:null finished:false header:: 3,8  replyHeader:: 3,12884902281,0  request:: '/hbase1,F  response:: v{'replication,'meta-region-server,'rs,'splitWAL,'backup-masters,'table-lock,'flush-table-proc,'region-in-transition,'online-snapshot,'master,'running,'recovering-regions,'draining,'namespace,'hbaseid,'table} 
2020-04-15 15:22:51.229  INFO 37109 --- [           main] c.p.e.h.service.BulkImportFakenamesData  : now at line [700] rowKey [raglin-judy-k-700]
2020-04-15 15:22:51.271  INFO 37109 --- [           main] c.p.e.h.service.BulkImportFakenamesData  : now at line [800] rowKey [gilley-clifton-m-800]
2020-04-15 15:22:51.334  INFO 37109 --- [           main] c.p.e.h.service.BulkImportFakenamesData  : now at line [900] rowKey [baillie-lucy-b-900]
2020-04-15 15:22:51.440  INFO 37109 --- [           main] c.p.e.h.service.BulkImportFakenamesData  : now at line [1000] rowKey [cruz-charles-a-1000]

5)导入完成后可以通过hbase shell查看已经导入的数据:

[root@k8s-master1 ~]# hbase shell
hbase(main):001:0> list
6 row(s) in 0.4680 seconds
=> ["ConfigurationManagementGraph", "credio", "custom-table", "fakenames", "user"]
hbase(main):002:0> scan 'fakenames'
ROW                                          COLUMN+CELL                                  acosta-kacy-p-343                           column=contactinfo:city, timestamp=1586935372171, value=Laurens
acosta-kacy-p-343                           column=contactinfo:country, timestamp=1586935372171, value=US

2. hbase shell常用命令:

  • hbase shell:用于使用触发命令行交互脚本,客户可以通过此shell进行数据操作;

  • list用于查看当前hbase数据库中数据表的列表;

  • 创建数据表user,其包括infodata两个列族;

    hbase(main):010:0> create 'user', 'info', 'data'
  • user表中插入信息,row keyrk0001,列族info中添加名为name的列,值为zhangsan

    hbase(main):011:0> put 'user', 'rk0001', 'info:name', 'zhangsan'
  • 通过rowkey进行数据查询 get命令:

    # 获取user表中row key为rk0001的所有信息(即所有cell的数据)
    hbase(main):015:0> get 'user', 'rk0001'
  • 通过scan命令查询数据表中所有数据信息:

    hbase(main):032:0>  scan 'user

Note: 项目目前仅提供对hbase数据库的基本操作,并通过csv文件将数据导入到hbase数据库中。对于一些复杂的查询操作目前暂不支持,后期增加过滤器、计数器、协处理器的内容,并包含mapreduce集成的内容。