batch_get not giving correct result

Question

batch_get not giving correct result

deenbandhu-agarwal opened this issue 7 years ago · 24 comments

deenbandhu-agarwal commented 7 years ago

Hi

I am using batch_get in the production.
Sometimes batch_get does not give results for some particular keys but at the same time get gives the result for the same keys. I made Sure when this issue occurred there was no migration going on.

Answer 1 · 2018-02-06T11:04:12.000Z

Just to clarify, are you saying that the client always returns no results when you call batch_get for specific keys?

Can you turn on debug logging for the client and check if there are any messages getting logged when you execute the batch_get command?

Answer 2 · 2018-02-06T12:29:06.000Z

No it is not for any specific keys. Yes I will turn on the debug logging and let you know

Answer 3 · 2018-02-09T03:12:55.000Z

@deenbandhu-agarwal, did you manage to capture some debug logs?

Another question: When you call Client#batch_get, how many entries does the resulting record array contain? As many entries as keys you passed in (but with some entries being nil)? Or does the resulting array have less entries than the number of keys?

Answer 4 · 2018-02-20T13:16:21.000Z

Hi
Sorry for late reply. I was debugging the issue and I found these observations.

I am receiving as many entries in result as keys i passed in.
The error come for a very small time and on a particular client only(not all the clients that are accessing Aerospike at that time). I logged the node ip(Aerospike node on which request is sent) and key for which the batch_get fails and when I checked the node for that particular key it is giving something else.
So according above observation it looks like I am getting wrong mapping of key and node ip and probably this is the reason that batch_get fails but get gives me the result.
I think these observations will be helpful for you to debug the issue

Answer 5 · 2018-02-21T06:55:34.000Z

@deenbandhu-agarwal,

It sounds like the client's partition map is not up-to-date. That would cause the client to send the batch requests for some of the keys to the wrong cluster node. And since the Ruby client uses the older Batch Direct protocol, the request will not be proxied to the correct master node.

But you also said that at the time there were no ongoing migrations. So if one ore more clients' partition maps were outdated, that in itself would be a problem.

Is it always the same client that has this problem? Does restarting the client help?

Answer 6 · 2018-02-21T07:03:21.000Z

@jhecking
No issue occurs on random client and probably the issue is solved by restarting only because after 10-15 minuted our client restart automatically so it is difficult to simulate for us that restart scenario but i think restart will solve the issue

Answer 7 · 2018-02-21T07:04:54.000Z

@jhecking
I also have a question what could be the reason for outdated partition map if there is no migration ?

Answer 8 · 2018-02-21T07:09:48.000Z

@jhecking
Why we are not thinking in the direction of implementing batch index protocol

Answer 9 · 2018-02-21T07:43:23.000Z

I also have a question what could be the reason for outdated partition map if there is no migration ?

The partition map changes as cluster nodes get added or removed from the cluster. Usually, the client should update the partition map from each cluster node about once a second. If that does not happen, that would be considered a bug. I assume you have already checked the client logs for any error messages?

Why we are not thinking in the direction of implementing batch index protocol?

Yes, that would definitely be a good solution. (Besides fixing any bugs in the partition update logic, if there are any indeed any, off course.) But I'm currently tied up with other projects and am not planning to do any significant feature updates to the Ruby client in the near future. PRs are off course always welcome!

Answer 10 · 2018-02-21T07:49:07.000Z

Also, possibly unrelated, but why are you restarting your clients automatically every 10-15 minutes?

Answer 11 · 2018-02-21T09:58:05.000Z

Also, possibly unrelated, but why are you restarting your clients automatically every 10-15 minutes?

This is because of our some internal issue we restart the complete app so client get restarted automatically

Usually, the client should update the partition map from each cluster node about once a second

Can you tell me where exactly in code where map get updated and how it get updated probably i can look into the code and find the bug.

Answer 12 · 2018-02-21T10:16:03.000Z

The partition maps for all cluster nodes get updated in the Cluster#tend method roughly every second (depending on the configured tend interval). During cluster tend, Node#refresh gets called for every cluster node. This method fetches some information from the cluster node, including the node's current partition generation. If any node's generation number has changed, the client updates the cluster's partition map in Cluster#update_partitions.

Btw, what server version are you running? Community or Enterprise edition? And how many nodes do you have in your cluster?

Answer 13 · 2018-02-21T12:40:55.000Z

@jhecking
I am using community edition and i am having 10 nodes in the cluster

Answer 14 · 2018-02-21T13:20:04.000Z

@jhecking I am not able to enable to info level log only once i enable logs it is giving me debug level log also. I don't want to enable debug level log as it will effect my performance. can you please look into this issue ?

Answer 15 · 2018-02-21T13:40:49.000Z

I am not able to enable to info level log only once i enable logs it is giving me debug level log also.

Seems to work for me - how are you setting the log level?

$ bundle exec irb
2.3.1 :001 > require 'aerospike'
 => true
2.3.1 :002 > Aerospike.logger.level = Logger::INFO
 => 1
2.3.1 :003 > client = Aerospike::Client.new
I, [2018-02-21T21:37:59.030165 #66914]  INFO -- : No connections available; seeding...
...
I, [2018-02-21T21:37:59.043842 #66914]  INFO -- : New cluster initialized and ready to be used...
 => #<Aerospike::Client:0x007fb64c9404a8 ...>
2.3.1 :004 > client.create_index('test', 'test', 'test_idx', 's', :string)
 => #<Aerospike::IndexTask:0x007fb64c8c1310 ...>
2.3.1 :005 > Aerospike.logger.level = Logger::DEBUG
 => 0
2.3.1 :006 > client.create_index('test', 'test', 'test_idx', 's', :string)
D, [2018-02-21T21:38:19.381790 #66914] DEBUG -- : Sending info command: sindex-create:ns=test;set=test;indexname=test_idx;numbins=1;indexdata=s,STRING;priority=normal
 => #<Aerospike::IndexTask:0x007fb64c868990 ...>

Answer 16 · 2018-03-01T05:12:46.000Z

Hi @deenbandhu-agarwal, have you had any luck in finding out more about this problem?

Answer 17 · 2018-03-15T07:43:23.000Z

@jhecking
Hi i got busy in something else.
I was not able to find root cause of the issue but I am thinking to implement the Batch_index protocol but i want to get my changed merged to the version 2.1.1 as we are using that version in our production will you be able to do that ?

Answer 18 · 2018-03-15T13:50:25.000Z

@deenbandhu-agarwal, I am curious: What is stopping you from upgrading to v2.5? Technically, I can see only two potential issues:

If you are using an older Ruby version: Support for Ruby v2.0 and v2.1 has since been dropped from the client and support for Ruby v2.2 is likely going to be dropped with the next release.
If you are still using LDT - support for this feature has been dropped in v2.5.

It would be quite unusual to back-port a significant feature like Batch Index support to an 18-month-old minor version.

Answer 19 · 2018-03-16T04:29:17.000Z

Yeah actually reason is I am using Ruby v2.1

Answer 20 · 2018-03-16T07:49:21.000Z

I see. But support for Ruby v2.1 was only dropped in v2.5 of the client. And even that client version should still be fully functional on Ruby v2.1. It's just that we have removed Ruby v2.1 from the CI support matrix. But there were no actual changes in client v2.5 that would break compatibility with this Ruby version. I just tried running the specs on master against Ruby v2.1 and even v2.0 and they pass without problems. So I would still recommend you try upgrading to the latest client version or at least v2.4. You should also consider upgrading to a newer Ruby version - even the Ruby core team has stopped supporting Ruby v2.1 for almost a year now.

Answer 21 · 2018-03-21T07:43:22.000Z

@jhecking can you please give me push rights i am not able to push the branch.

Answer 22 · 2018-03-21T08:32:00.000Z

@deenbandhu-agarwal, in order to create a pull request to merge your changes into this repo, please first create a fork of this repo, then commit your changes to a new branch on your fork, and then, finally, create the pull request from your fork.

Or see this tutorial about how to make your first contribution via GitHub pull requests.

Answer 23 · 2018-03-21T09:34:11.000Z

@jhecking
I have created pull request #61
please review and merge it

Thanks

Answer 24 · 2018-04-11T06:38:33.000Z

PR #61 has been merged. I am going to close this issue now.