basho/riak-dotnet-client

Nodes are not marked Offline for a failed GET request

Closed this issue · 10 comments

When a node goes offline, the Riak Client should mark the node as unavailable and then not hit this node until the Node Monitor detects the node is available again. This is working correctly for some commands (like Execute), and failing for other commands (like GET).

To reproduce the issue:

Configure the app.config to reference a single node (not required, but makes debugging a lot easier)

  1. Disable the Node Monitor
  2. Issue a GET command
  3. Kill the connection to the server (I use wkillcx.exe, our firewall also kills the connection after several minutes of inactivity
  4. Issue a GET command

Repeat using a EXECUTE command.

Step 4 will fail for an EXECUTE, and will succeed for a GET. The correct behavior is to fail, because the node should have been marked as Offlne.

When the Node Monitor thread is running the node will re-appear after 5 seconds.

The problem is being caused by the NodeOffline state being lost during the return. Pull request #303 has been submitted in an effort to address this issue.

Thanks, we'll get to this in the next .NET client release.

I believe this is a duplicate / same issue as #296

Hi @rob-somerville - to be honest, I can't reproduce this issue. I modified my app.config to use only one node, with a pool size of 1 and upped the nodePollTime to 5000000. I then hard-coded NodeOffline to false (to undo your change). When I run riak stop on the one node I'm hitting, subsequent Get() and FetchServerInfo calls don't succeed at all, and don't try other nodes (obviously, since they're not configured).

I am assuming that putting this change into your environment fixes the issue you are seeing?

I'll keep trying to reproduce. I like the change and have a few modifications I'm making. I added the ChaosMonkeyApp which I run against a cluster, then run the chaos-monkey script to take down nodes while the console app is running.

Hey Luke,

I’m working on doing some more testing too. I’ll let you know what we see.

We were only testing what happens when the connection is closed (happens to us a lot because of an internal firewall rule), and not when a node goes off-line. We’ve been running more tests today for a node going off-line – should have some results very shortly.

Many thanks, appreciate you looking at.

Rob

From: Luke Bakken [mailto:notifications@github.com]
Sent: Friday, April 8, 2016 1:33 PM
To: basho/riak-dotnet-client riak-dotnet-client@noreply.github.com
Cc: Rob Somerville rob.somerville@mindbodyonline.com
Subject: Re: [basho/riak-dotnet-client] Nodes are not marked Offline for a failed GET request (#304)

Hi @rob-somervillehttps://github.com/rob-somerville - to be honest, I can't reproduce this issue. I modified my app.config to use only one node, with a pool size of 1 and upped the nodePollTime to 50000. I then hard-coded NodeOffline to false (to undo your change). When I run riak stop on the one node I'm hitting, subsequent Get() and FetchServerInfo calls don't succeed at all, and don't try other nodes (obviously, since they're not configured).

I am assuming that putting this change into your environment fixes the issue you are seeing?

I'll keep trying to reproduce. I like the change and have a few modifications I'm making.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHubhttps://github.com//issues/304#issuecomment-207591937

Thanks for that info Rob, I'll look into reproducing that specific condition.

@rob-somerville - with that info you provided, I used the tcpkill program to kill connections to the Riak node, while having one node configured, a retry count of one, and a pool size of one. Without your change, I could reproduce the behavior you describe, and with your change, the client correctly gets this error:

Unable to access functioning Riak node

Hey Luke,

I don’t have the world’s most enviable test environment for development, but here’s what we have and what we’re seeing:

We have a single node in development. To simulate multiple nodes we’ve setup two entries in our hosts file, and two nodes in our config file. To simulate a failure we remove one of the nodes from the host file and then kill the connection to that node. The test we are running is just a loop that calls GET or EXECUTE once a second. We’ve bumped up NodePollTime to 50000 and defaultRetryWaitTime to 2000 just to make it easier to see what’s happening.

For the EXECUTE call, when the node is removed, we see a 2 second pause and then execution resumes. Stepping through the code, the first call to the offline node fails, there is a timeout, and then the second call succeeds. Subsequent calls are only directed to the remaining online node. For the GET call, we also see a 2 second pause when the node goes offline, and then execution resumes with a two second delay for each subsequent call. Stepping through the code, the first call to the offline node fails, there is a timeout and the second call succeeds. However, the node has not been marked offline – so we continue to see 2 second delay until the node comes back on-line.

Seems to be ok after the change.

Going to get Engineering to open the ports to our Riak servers so that our applications don’t need to run through the load balancer.

Thanks,

Rob

From: Luke Bakken [mailto:notifications@github.com]
Sent: Friday, April 8, 2016 2:42 PM
To: basho/riak-dotnet-client riak-dotnet-client@noreply.github.com
Cc: Rob Somerville rob.somerville@mindbodyonline.com
Subject: Re: [basho/riak-dotnet-client] Nodes are not marked Offline for a failed GET request (#304)

@rob-somervillehttps://github.com/rob-somerville - with that info you provided, I used the tcpkill program to kill connections to the Riak node, while having one node configured, a retry count of one, and a pool size of one. Without your change, I could reproduce the behavior you describe, and with your change, the client correctly gets this error:

Unable to access functioning Riak node


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHubhttps://github.com//issues/304#issuecomment-207619339

@rob-somerville - could you confirm that the change in PR #305 addresses the issue using your test? I'll do the same test in my environment. Thanks!

Hey Luke,

I’m going to run a few more tests this afternoon, but so far it’s passing and addressing the issues from earlier.

Many thanks,

Rob

From: Luke Bakken [mailto:notifications@github.com]
Sent: Saturday, April 9, 2016 8:00 AM
To: basho/riak-dotnet-client riak-dotnet-client@noreply.github.com
Cc: Rob Somerville rob.somerville@mindbodyonline.com
Subject: Re: [basho/riak-dotnet-client] Nodes are not marked Offline for a failed GET request (#304)

@rob-somervillehttps://github.com/rob-somerville - could you confirm that the change in PR #305#305 addresses the issue using your test? I'll do the same test in my environment. Thanks!


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHubhttps://github.com//issues/304#issuecomment-207801466

Great, I appreciate the update.