DS starts in read-only mode when another replica is unreachable
Opened this issue · 5 comments
Summary
When in multi-master replication mode a single server is unreachable (socket connection has to timeout) it will cause replication server to not be able to accept any connection due to data server's own socket timeout. This happens during the first DS-RS handshake phase and is accompanied with the following error in the log:
category=SYNC severity=ERROR msgID=178 msg=Directory server 15265 was attempting to connect to replication server 25602 but has disconnected in handshake phase
Steps To Reproduce ("Repro Steps")
- Download attached docker-compose.yml
- Start both server instances with
docker-compose up -d
- Configure replication:
docker exec -it wrends-test1 \
dsreplication enable --adminUID admin --adminPassword password --trustAll --no-prompt --baseDN dc=example,dc=com \
--host1 wrends-test1 --port1 4444 --bindDN1 "cn=Directory Manager" \
--bindPassword1 password --replicationPort1 8989 \
--host2 wrends-test2 --port2 4444 --bindDN2 "cn=Directory Manager" \
--bindPassword2 password --replicationPort2 8989
docker exec -it wrends-test1 \
dsreplication initialize-all --adminUID admin --adminPassword password --trustAll --no-prompt \
--baseDN dc=example,dc=com --hostname wrends-test1 --port 4444
- Shutdown both servers with
docker-compose stop
- Start only the first server with
docker-compose up -d wrends-test1
- Try to perform modification with
docker exec -it wrends-test1 ldapdelete -h localhost -p 1389
docker exec -it wrends-test1 ldapdelete -h localhost -p 1389 "uid=user.1,ou=People,dc=example,dc=com"
Expected Result (Behavior You Expected to See)
Server deletes the requested LDAP entry.
Actual Result (Behavior You Saw)
The following error is returned:
Processing DELETE request for uid=user.1,ou=People,dc=example,dc=com
The LDAP delete request failed: 53 (Unwilling to Perform)
Additional Information: The Replication is configured for suffix
dc=example,dc=com but was not able to connect to any Replication Server
Additional Notes
I have spent several hours trying to debug this. The underlying issue is that replication server's connection listener is trying to contact all other replication servers when accepting new connection. As the other server does not exist, this attempt timeouts with the same timeout value as is the data server's timeout for the connection handshake.
I am not sure why we need to wait for the replication domain to actually contact all servers. Simply increasing handshake timeout wouldn't work if we have multiple servers in the domain.
Creating this issue to actually track potential discussion regarding solving this problem.
GitHub won't let me attach YML file, so here it is:
version: "3.4"
x-shared-config: &shared-config
extra_hosts:
- "wrends-test1:10.0.0.31"
- "wrends-test2:10.0.0.32"
services:
wrends-test1:
image: wrensecurity/wrends:5.0.1
container_name: wrends-test1
environment:
ADDITIONAL_SETUP_ARGS: "--sampleData 10"
ROOT_USER_DN: cn=Directory Manager
ROOT_USER_PASSWORD: password
volumes:
- wrends-data:/opt/wrends/instance
networks:
wrenam:
ipv4_address: 10.0.0.31
<<: *shared-config
wrends-test2:
image: wrensecurity/wrends:5.0.1
container_name: wrends-test2
environment:
ROOT_USER_DN: cn=Directory Manager
ROOT_USER_PASSWORD: password
networks:
wrenam:
ipv4_address: 10.0.0.32
<<: *shared-config
volumes:
wrends-data:
networks:
wrenam:
name: wrends-test
ipam:
config:
- subnet: 10.0.0.0/24
What is actually happening under the hood:
-
DS tries to connect to RS and perform handshake phase 1:
-
RS accepts the connection and tries to perform the handshake as well:
-
RS tries to initialize replication domain parameters:
-
RS waits for establishing connections to other (non-responding) RSs:
-
...in the meantime, DS's handshake attempt times out...
To be honest I don't understand why DataServerHandler
has to wait for the replication server to actually perform connection check cycle:
I mean the only thing that happens from the point of view of DataServerHandler
is that under the hood ReplicationServer
will try to (re)connect to all other RSs, but nothing changes in terms of ReplicationDomain
's internal state if the connection fails... nothing happens, just the call takes a bit longer. And as a side effect of waiting for the connection check, the original socket times out.
When I drop the requirement for the connection check, everything works as expected. But dropping something that is there obviously for some reason is a pretty big deal :/.
Digging through commit history...
Based on 51ef33b it seems that the wait was always mandatory when creating new ReplicationDomain
in the "old days". This commit made the wait mandatory only for DS connections.
Commit message for this change 4d90aff seems like it can be the title of this issue:
Fix issue OpenDJ-96: Replication server monitor data computation takes too long / blocks rest of server when another RS is cannot be reached
Still, none the wiser to why the wait is there.
Ok, even though the timeout is very strangely implemented, the server is behaving as intended. With isolation-policy
set to accept-all-updates
server starts with write enabled.