DS starts in read-only mode when another replica is unreachable

Summary

When in multi-master replication mode a single server is unreachable (socket connection has to timeout) it will cause replication server to not be able to accept any connection due to data server's own socket timeout. This happens during the first DS-RS handshake phase and is accompanied with the following error in the log:

category=SYNC severity=ERROR msgID=178 msg=Directory server 15265 was attempting to connect to replication server 25602 but has disconnected in handshake phase

Steps To Reproduce ("Repro Steps")

Download attached docker-compose.yml
Start both server instances with docker-compose up -d
Configure replication:

docker exec -it wrends-test1 \
        dsreplication enable --adminUID admin --adminPassword password --trustAll --no-prompt --baseDN dc=example,dc=com \
        --host1 wrends-test1 --port1 4444 --bindDN1 "cn=Directory Manager" \
        --bindPassword1 password --replicationPort1 8989 \
        --host2 wrends-test2 --port2 4444 --bindDN2 "cn=Directory Manager" \
        --bindPassword2 password --replicationPort2 8989
docker exec -it wrends-test1 \
        dsreplication initialize-all --adminUID admin --adminPassword password --trustAll --no-prompt \
        --baseDN dc=example,dc=com --hostname wrends-test1 --port 4444

Shutdown both servers with docker-compose stop
Start only the first server with docker-compose up -d wrends-test1
Try to perform modification with docker exec -it wrends-test1 ldapdelete -h localhost -p 1389

docker exec -it wrends-test1 ldapdelete -h localhost -p 1389 "uid=user.1,ou=People,dc=example,dc=com"

Expected Result (Behavior You Expected to See)

Server deletes the requested LDAP entry.

Actual Result (Behavior You Saw)

The following error is returned:

Processing DELETE request for uid=user.1,ou=People,dc=example,dc=com
The LDAP delete request failed: 53 (Unwilling to Perform)
Additional Information:  The Replication is configured for suffix
dc=example,dc=com but was not able to connect to any Replication Server

Additional Notes

I have spent several hours trying to debug this. The underlying issue is that replication server's connection listener is trying to contact all other replication servers when accepting new connection. As the other server does not exist, this attempt timeouts with the same timeout value as is the data server's timeout for the connection handshake.

I am not sure why we need to wait for the replication domain to actually contact all servers. Simply increasing handshake timeout wouldn't work if we have multiple servers in the domain.

Creating this issue to actually track potential discussion regarding solving this problem.

GitHub won't let me attach YML file, so here it is:

version: "3.4"

x-shared-config: &shared-config
  extra_hosts:
    - "wrends-test1:10.0.0.31"
    - "wrends-test2:10.0.0.32"

services:
  wrends-test1:
    image: wrensecurity/wrends:5.0.1
    container_name: wrends-test1
    environment:
      ADDITIONAL_SETUP_ARGS: "--sampleData 10"
      ROOT_USER_DN: cn=Directory Manager
      ROOT_USER_PASSWORD: password
    volumes:
      - wrends-data:/opt/wrends/instance
    networks:
      wrenam:
        ipv4_address: 10.0.0.31
    <<: *shared-config

  wrends-test2:
    image: wrensecurity/wrends:5.0.1
    container_name: wrends-test2
    environment:
      ROOT_USER_DN: cn=Directory Manager
      ROOT_USER_PASSWORD: password
    networks:
      wrenam:
        ipv4_address: 10.0.0.32
    <<: *shared-config

volumes:
  wrends-data:

networks:
  wrenam:
    name: wrends-test
    ipam:
      config:
        - subnet: 10.0.0.0/24

What is actually happening under the hood:

DS tries to connect to RS and perform handshake phase 1:

wrends/opendj-server-legacy/src/main/java/org/opends/server/replication/service/ReplicationBroker.java

Lines 748 to 798 in b9af647

    
             private void connectAsDataServer() 
        
             { 
        
               /* 
        
                * If a first connect or a connection failure occur, we go through here. 
        
                * force status machine to NOT_CONNECTED_STATUS so that monitoring can see 
        
                * that we are not connected. 
        
                */ 
        
               domain.toNotConnectedStatus(); 
        
               /* Stop any existing heartbeat monitor and changeTime publisher from a previous session. */ 
        
               stopRSHeartBeatMonitoring(); 
        
               stopChangeTimeHeartBeatPublishing(); 
        
               mustRunBestServerCheckingAlgorithm = 0; 
        
               synchronized (connectPhaseLock) 
        
               { 
        
                 final int serverId = getServerId(); 
        
                 final DN baseDN = getBaseDN(); 
        
                 /* 
        
                  * Connect to each replication server and get their ServerState then find 
        
                  * out which one is the best to connect to. 
        
                  */ 
        
                 if (logger.isTraceEnabled()) 
        
                 { 
        
                   debugInfo("phase 1 : will perform PhaseOneH with each RS in order to elect the preferred one"); 
        
                 } 
        
                 // Get info from every available replication servers 
        
                 Map<Integer, ReplicationServerInfo> rsInfos = 
        
                     collectReplicationServersInfo(); 
        
                 computeNewTopology(toRSInfos(rsInfos)); 
        
                 if (rsInfos.isEmpty()) 
        
                 { 
        
                   setConnectedRS(ConnectedRS.noConnectedRS()); 
        
                 } 
        
                 else 
        
                 { 
        
                   // At least one server answered, find the best one. 
        
                   RSEvaluations evals = computeBestReplicationServer(true, -1, state, 
        
                       rsInfos, serverId, getGroupId(), getGenerationID()); 
        
                   // Best found, now initialize connection to this one (handshake phase 1) 
        
                   if (logger.isTraceEnabled()) 
        
                   { 
        
                     debugInfo("phase 2 : will perform PhaseOneH with the preferred RS=" + evals.getBestRS()); 
        
                   } 
        
                   final ConnectedRS electedRS = performPhaseOneHandshake( 
        
                       evals.getBestRS().getServerURL(), true);

RS accepts the connection and tries to perform the handshake as well:

wrends/opendj-server-legacy/src/main/java/org/opends/server/replication/server/ReplicationServer.java

Lines 239 to 268 in b9af647

    
             newSocket = listenSocket.accept(); 
        
             newSocket.setTcpNoDelay(true); 
        
             newSocket.setKeepAlive(true); 
        
             int timeoutMS = MultimasterReplication.getConnectionTimeoutMS(); 
        
             session = replSessionSecurity.createServerSession(newSocket, timeoutMS); 
        
             if (session == null) // Error, go back to accept 
        
             { 
        
               continue; 
        
             } 
        
           } 
        
           catch (Exception e) 
        
           { 
        
             // If problems happen during the SSL handshake, it is necessary 
        
             // to close the socket to free the associated resources. 
        
             if (newSocket != null) 
        
             { 
        
               newSocket.close(); 
        
             } 
        
             continue; 
        
           } 
        
           ReplicationMsg msg = session.receive(); 
        
           final int queueSize = this.config.getQueueSize(); 
        
           final int rcvWindow = this.config.getWindowSize(); 
        
           if (msg instanceof ServerStartMsg) 
        
           { 
        
             DataServerHandler dsHandler = new DataServerHandler( 
        
                 session, queueSize, this, rcvWindow); 
        
             dsHandler.startFromRemoteDS((ServerStartMsg) msg);

RS tries to initialize replication domain parameters:

wrends/opendj-server-legacy/src/main/java/org/opends/server/replication/server/DataServerHandler.java

Line 337 in b9af647

setBaseDNAndDomain(serverStartMsg.getBaseDN(), true);

RS waits for establishing connections to other (non-responding) RSs:

wrends/opendj-server-legacy/src/main/java/org/opends/server/replication/server/MessageHandler.java

Lines 176 to 187 in b9af647

    
           private void setDomain(boolean waitConnections) 
        
           { 
        
             if (replicationServerDomain == null) 
        
             { 
        
               replicationServerDomain = replicationServer.getReplicationServerDomain( 
        
                   baseDN, true); 
        
               if (waitConnections) 
        
               { 
        
                 replicationServer.waitConnections(); 
        
               } 
        
             } 
        
           }

...in the meantime, DS's handshake attempt times out...

To be honest I don't understand why DataServerHandler has to wait for the replication server to actually perform connection check cycle:

wrends/opendj-server-legacy/src/main/java/org/opends/server/replication/server/DataServerHandler.java

Line 337 in b9af647

setBaseDNAndDomain(serverStartMsg.getBaseDN(), true);

wrends/opendj-server-legacy/src/main/java/org/opends/server/replication/server/MessageHandler.java

Lines 566 to 567 in b9af647

    
           setDomain(!"cn=changelog".equals(baseDN.toNormalizedUrlSafeString()) 
        
               && isDataServer);

I mean the only thing that happens from the point of view of DataServerHandler is that under the hood ReplicationServer will try to (re)connect to all other RSs, but nothing changes in terms of ReplicationDomain's internal state if the connection fails... nothing happens, just the call takes a bit longer. And as a side effect of waiting for the connection check, the original socket times out.

When I drop the requirement for the connection check, everything works as expected. But dropping something that is there obviously for some reason is a pretty big deal :/.

Digging through commit history...

Based on 51ef33b it seems that the wait was always mandatory when creating new ReplicationDomain in the "old days". This commit made the wait mandatory only for DS connections.

Commit message for this change 4d90aff seems like it can be the title of this issue:

Fix issue OpenDJ-96: Replication server monitor data computation takes too long / blocks rest of server when another RS is cannot be reached

Still, none the wiser to why the wait is there.

Ok, even though the timeout is very strangely implemented, the server is behaving as intended. With isolation-policy set to accept-all-updates server starts with write enabled.

	private void connectAsDataServer()
	{
	/*
	* If a first connect or a connection failure occur, we go through here.
	* force status machine to NOT_CONNECTED_STATUS so that monitoring can see
	* that we are not connected.
	*/
	domain.toNotConnectedStatus();

	/* Stop any existing heartbeat monitor and changeTime publisher from a previous session. */
	stopRSHeartBeatMonitoring();
	stopChangeTimeHeartBeatPublishing();
	mustRunBestServerCheckingAlgorithm = 0;

	synchronized (connectPhaseLock)
	{
	final int serverId = getServerId();
	final DN baseDN = getBaseDN();

	/*
	* Connect to each replication server and get their ServerState then find
	* out which one is the best to connect to.
	*/
	if (logger.isTraceEnabled())
	{
	debugInfo("phase 1 : will perform PhaseOneH with each RS in order to elect the preferred one");
	}

	// Get info from every available replication servers
	Map<Integer, ReplicationServerInfo> rsInfos =
	collectReplicationServersInfo();
	computeNewTopology(toRSInfos(rsInfos));

	if (rsInfos.isEmpty())
	{
	setConnectedRS(ConnectedRS.noConnectedRS());
	}
	else
	{
	// At least one server answered, find the best one.
	RSEvaluations evals = computeBestReplicationServer(true, -1, state,
	rsInfos, serverId, getGroupId(), getGenerationID());

	// Best found, now initialize connection to this one (handshake phase 1)
	if (logger.isTraceEnabled())
	{
	debugInfo("phase 2 : will perform PhaseOneH with the preferred RS=" + evals.getBestRS());
	}

	final ConnectedRS electedRS = performPhaseOneHandshake(
	evals.getBestRS().getServerURL(), true);

	newSocket = listenSocket.accept();
	newSocket.setTcpNoDelay(true);
	newSocket.setKeepAlive(true);
	int timeoutMS = MultimasterReplication.getConnectionTimeoutMS();
	session = replSessionSecurity.createServerSession(newSocket, timeoutMS);
	if (session == null) // Error, go back to accept
	{
	continue;
	}
	}
	catch (Exception e)
	{
	// If problems happen during the SSL handshake, it is necessary
	// to close the socket to free the associated resources.
	if (newSocket != null)
	{
	newSocket.close();
	}
	continue;
	}

	ReplicationMsg msg = session.receive();

	final int queueSize = this.config.getQueueSize();
	final int rcvWindow = this.config.getWindowSize();
	if (msg instanceof ServerStartMsg)
	{
	DataServerHandler dsHandler = new DataServerHandler(
	session, queueSize, this, rcvWindow);
	dsHandler.startFromRemoteDS((ServerStartMsg) msg);

	private void setDomain(boolean waitConnections)
	{
	if (replicationServerDomain == null)
	{
	replicationServerDomain = replicationServer.getReplicationServerDomain(
	baseDN, true);
	if (waitConnections)
	{
	replicationServer.waitConnections();
	}
	}
	}

	setDomain(!"cn=changelog".equals(baseDN.toNormalizedUrlSafeString())
	&& isDataServer);