OpenTSDB/asynchbase

HBaseRPC errors after hbase drive crashes.

jonbonazza opened this issue · 1 comments

This isn't really an issue with asynchbase, but more of a request for support on an issue we are seeing in our dev environment when using this library. I apologize if this isn't the best place to put this. If it isn't, please feel free to close it and point me in the right direction. Thanks

Anyway, we experienced a drive crash in one of our hbase nodes and since then, we started to get some expected timeout errors and such. Nothing out of the ordinary. When the we recovered the drives, however, we restarted our hbase clients and now, we are seeing the following errors any time we try to access hbase:
ERROR [2016-06-23 17:31:56,779] org.hbase.async.HBaseRpc: Receieved a timeout handle HashedWheelTimeout(deadline: 3979557 ns ago, task: org.hbase.async.HBaseRpc$TimeoutTask@1a92f792) that doesn't match our own org.hbase.async.HBaseRpc$TimeoutTask@1a92f792
ERROR [2016-06-23 17:31:56,780] org.hbase.async.RegionClient: Removed the wrong RPC null when we meant to remove Exists(table=, key=, family=null, qualifiers=null, attempt=12, region=RegionInfo(table=, region_name=",,1456448830516.172da4965e60d3186998c6e07af2f6c0.", stop_key=""))
WARN [2016-06-23 17:31:56,791] org.hbase.async.HBaseClient: Probe Exists(table=, key=, family=null, qualifiers=null, attempt=0, region=RegionInfo(table=, region_name=",,1456448830516.172da4965e60d3186998c6e07af2f6c0.", stop_key="")) failed
! org.hbase.async.RpcTimedOutException: RPC ID [12] timed out waiting for response from HBase on region client [RegionClient@1325442575(chan=[id: 0x90e43262, / => ], #pending_rpcs=0, #batched=0, #rpcs_inflight=0) ] for over 15000ms
! at org.hbase.async.HBaseRpc$TimeoutTask.run(HBaseRpc.java:618) [metrics-drop.jar:3.0.1]
! at org.jboss.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:556) [metrics-drop.jar:3.0.1]
! at org.jboss.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:632) [metrics-drop.jar:3.0.1]
! at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:369) [metrics-drop.jar:3.0.1]
! at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [metrics-drop.jar:3.0.1]
! at java.lang.Thread.run(Thread.java:745) [na:1.8.0_92]
...
...
...

Does this mean our data is corrupted?

Is there some way to recover from this?

Luckily this ocurred in our dev env and not in production, so we have a little more liberty to play with the data, but we'd like to understad what, exactly, is going on in case this ever does occur in production.

Thanks in advance.

Hm, no this particular exception means that the HBase server wasn't answering RPCs quickly enough. These should disappear once HBase is running in a healthy manner but if you keep seeing the timeouts then see if RPCs are in the HBase queues for a long time (look at process call time, etc)