couchbase/sync_gateway

Sync gateway seems to not handle graceful failover correctly

olivierboudet opened this issue · 10 comments

Sync Gateway version

docker image 2.1.2-community

Operating system

Linux Ubuntu 18.04

Config file

{
	"adminInterface":"0.0.0.0:4985",
	"interface":"0.0.0.0:4984",
	"databases": {
		"structure": {
			"unsupported": {
				"user_views": {
				  "enabled": true
				}
			},
			"server": "http://couchbase-service:8091",
			"allow_empty_password": false,
			"bucket": "structure",
			"username": "syncgateway",
			"password": "syncgateway",
			"use_views": true,
			"sync": "function(doc, oldDoc) {channel(doc.model+'_'+doc.identifiantSite)};",
			"users": {
				"GUEST": {
					"disabled": true,
					"admin_channels": [
						"*"
					],
					"all_channels": [
						"*"
					]
				}
			}
		}
	}
}

couchbase-service is a k8s service with these endpoints :

kprod describe endpoints couchbase-service
Name:         couchbase-service
Namespace:    default
Labels:       app=couchbase-worker-pod
              chart=couchbase-1.0.0
              heritage=Tiller
              release=couchbase
Annotations:  <none>
Subsets:
  Addresses:          10.32.2.189,10.32.7.13,10.32.8.17
  NotReadyAddresses:  <none>
  Ports:
    Name        Port  Protocol
    ----        ----  --------
    http-views  8092  TCP
    http        8091  TCP

Events:  <none>

Log output

https://gist.github.com/olivierboudet/926f2407d42d42a6e8f6e86f7c3be12b

Expected behavior

When doing a graceful failover of a node, data must be available for user via the sync gateway.

Actual behavior

In a 3 nodes cluster with one instance of SG, when doing a graceful failover of a node, user can not fetch data from the sync gateway. In the logs I can see multiple occurence of the line :

"2019-08-30T09:50:50.397Z [WRN] MultiChangesFeed got error reading changes feed \"STRUCTURE_22RW7\": unauthorized - password required -- db.(*Database).SimpleMultiChangesFeed.func1() at changes.go:493"

This error disappear as soon as the failover is done, and the user can fetch data again.

bbrks commented

Hi @olivierboudet,

Thanks for the detailed bug report.

I've filed a Jira issue over at https://issues.couchbase.com/browse/CBG-500 to track this.

@olivierboudet Is it possible to collect and share Couchbase Server logs for the time you saw this issue? That would help identify the root cause of this issue.

@adamcfraser I try to collect all useful logs when the issue happened.

FYI, I upgraded the three nodes one by one on august 29th and 30th (so, there is some logs "Some nodes didn't respond" in stats.log which corresponds to the upgrade time of each node).
The nodes was upgraded in order (node1, node2 then node3 in the archive attached), and the issue described here was observed during the graceful failover on node3 (started around 2019-08-30 09:10:00))

couchbase.tar.gz

@olivierboudet Thanks - I had an initial look at the logs but haven't identified anything related to the failure on the SG side yet. Getting a full cbcollect might provide additional information - can you generate that and share? Full Sync Gateway logs would also be helpful for additional context.

@adamcfraser Thanks to investigate on this.
I collected all the logs of the cluster and the sync gateway. But I prefer not to share it publicly as some private information appears in sync gateway logs (username, document ids, etc).

Is there a way I can send you privately these logs or at least a password to retrieve the logs from a server (email, other ?)

Thanks

bbrks commented

@olivierboudet You can email me at ben.brooks@couchbase.com with either a link to where they're available, or I can provide information on how to upload this data to Couchbase directly.

Thanks!

This is potentially a go SDK issue - have filed https://issues.couchbase.com/browse/GOCBC-591 for further investigation.

Having said that, in general GSI is going to be a better choice for graceful failover, as views need to be completely rebuild for vbuckets on the node that's failed over. You may want to consider using that as an alternative for the time being.

@adamcfraser Thank you for the investigation.

I already considered moving to GSI but I have a question. With community edition, the GSI replication is not available. How the graceful failover is handled in this case ? Without index replication, I don't see the benefits of using GSI for the failover and High Availability ?

@olivierboudet The SDK team hasn't been able to reproduce the issue you're seeing. If you can reliably reproduce this issue, providing a packet capture of traffic between Sync Gateway and Couchbase Server may help us reproduce and get to the bottom of it.

The SDK team has identified the underlying issue, and their fix has been picked up on master (will be available in the next release).

gocb fix picked up along with manifest changes in #4247