orientechnologies/orientdb

Timeout during property creation on a distributed OrientDB configuration

Opened this issue · 4 comments

Due to this issue, we cannot launch the production server successfully. The migration stalls and the application fails to initialize correctly in the distributed environment.

We are encountering a problem when starting the application in a distributed database configuration with 5 nodes. The issue occurs specifically when creating a property for a vertex with a large number of records in the database. During this operation, we see the following error:

Caused by: com.orientechnologies.orient.server.distributed.task.ODistributedOperationException: Quorum 5 not reached for request (id=2.15877 task=sql_command_ddl_second_phase). Elapsed=24740ms. No server in conflict. Received: 

- DB-node-A: waiting-for-response

- DB-node-B: waiting-for-response

- DB-node-C: waiting-for-response

- DB-node-D: waiting-for-response

- DB-node-E: waiting-for-response	DB name="db"	DB name="db"

	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)

	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)

	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)

	at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleException(OChannelBinaryAsynchClient.java:355)

	at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:303)

	at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.handleStatus(OChannelBinaryAsynchClient.java:325)

	at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:209)

	at com.orientechnologies.orient.client.binary.OChannelBinaryAsynchClient.beginResponse(OChannelBinaryAsynchClient.java:167)

	at com.orientechnologies.orient.client.remote.OStorageRemote.beginResponse(OStorageRemote.java:2003)

	at com.orientechnologies.orient.client.remote.OStorageRemote.lambda$networkOperationRetryTimeout$2(OStorageRemote.java:435)

	at com.orientechnologies.orient.client.remote.OStorageRemote.baseNetworkOperation(OStorageRemote.java:500)

	at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationRetryTimeout(OStorageRemote.java:415)

	at com.orientechnologies.orient.client.remote.OStorageRemote.networkOperationNoRetry(OStorageRemote.java:450)

	at com.orientechnologies.orient.client.remote.OStorageRemote.command(OStorageRemote.java:1169)

	at com.orientechnologies.orient.client.remote.db.document.ODatabaseDocumentRemote.command(ODatabaseDocumentRemote.java:430)

	at com.orientechnologies.orient.client.remote.metadata.schema.OClassRemote.addProperty(OClassRemote.java:83)

	at com.orientechnologies.orient.core.metadata.schema.OClassImpl.createProperty(OClassImpl.java:417)

	at com.orientechnologies.orient.core.metadata.schema.OClassAbstractDelegate.createProperty(OClassAbstractDelegate.java:166)

	at com.tinkerpop.blueprints.impls.orient.OrientElementType.access$201(OrientElementType.java:34)

	at com.tinkerpop.blueprints.impls.orient.OrientElementType$3.call(OrientElementType.java:94)

	at com.tinkerpop.blueprints.impls.orient.OrientElementType$3.call(OrientElementType.java:91)

	at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.executeOutsideTx(OrientBaseGraph.java:1849)

	at com.tinkerpop.blueprints.impls.orient.OrientElementType.createProperty(OrientElementType.java:90)

	at com.tinkerpop.blueprints.impls.orient.OrientVertexType.createProperty(OrientVertexType.java:133)

	at com.tinkerpop.blueprints.impls.orient.OrientVertexType.createProperty(OrientVertexType.java:32)

We tried changing the database configuration, but it didn't help. Here's the configuration we used:

final long distributedResponsesTimeout = 60000L;
final var orientDBConfig = OrientDBConfig.builder()
        .addConfig(DISTRIBUTED_ASYNCH_RESPONSES_TIMEOUT, distributedResponsesTimeout)
        .addConfig(NETWORK_SOCKET_TIMEOUT, distributedResponsesTimeout)
        .addConfig(NETWORK_LOCK_TIMEOUT, distributedResponsesTimeout)
        .addConfig(DISTRIBUTED_PURGE_RESPONSES_TIMER_DELAY, distributedResponsesTimeout)
        .addConfig(DISTRIBUTED_AUTO_REMOVE_OFFLINE_SERVERS, distributedResponsesTimeout)
        .addConfig(DISTRIBUTED_CHECK_HEALTH_CAN_OFFLINE_SERVER, true)
        .addConfig(DISTRIBUTED_CRUD_TASK_SYNCH_TIMEOUT, distributedResponsesTimeout)
        .addConfig(DISTRIBUTED_COMMAND_TASK_SYNCH_TIMEOUT, distributedResponsesTimeout)
        .addConfig(DISTRIBUTED_MAX_STARTUP_DELAY, distributedResponsesTimeout)
        .addConfig(DISTRIBUTED_HEARTBEAT_TIMEOUT, distributedResponsesTimeout)
        .addConfig(DISTRIBUTED_CHECK_HEALTH_EVERY, 5000L)
        .build();
instance.orientDB = new OrientDB(
        databaseUrl,
        databaseSettings.databaseUser,
        databaseSettings.databasePassword,
        orientDBConfig
);

Additionally, in the logs of the master node, we see the following warning:

2024-09-25 09:26:18:430 WARNI {db=db} [DB-node-B] Timeout (24465ms) on waiting for synchronous responses from nodes=[DB-node-A, DB-node-B, DB-node-C, DB-node-D, DB-node-E] responsesSoFar=[DB-node-D] request=(id=1.5390 task=sql_command_ddl_second_phase) [OHazelcastPlugin]

Hi,

I think this is due some data migrations happening while the property is created, you can skip the check and migration of the property using the unsafe option, if you are not sure the property has the right values you can run a migration of data before, this is actually what OrientDB does for itself.

here an example of create property unsafe:

crate property MyVertex.name String unsafe 

Correspondent code that do the data migration, take from OrientDB code:

 try (OResultSet result =
        database.query("select from MyVertex where name.type() <> 'STRING' ")) {
      while (result.hasNext()) {
        ODocument record = (ODocument) result.next().getElement().get();
        record.field("name", record.field("name"), OType.STRING);
        database.save(record);
      }
    }

Obviously you can edit as you need.

We will check on our side as well, this data migrations need to be done only by one node, and I think as today are re-executed on all the nodes in parallel, creating potential issues.

@tglman
Currently, we are using the Tinkerpop API to create properties: com.tinkerpop.blueprints.impls.orient.OrientElementType#createProperty(java.lang.String, com.orientechnologies.orient.core.metadata.schema.OType)

When we do this via ODatabaseSession.query(), we get the following error:
Caused by: com.orientechnologies.orient.core.exception.OSchemaException: Cannot create property 'property' inside a transaction

Hi,

DDL as today cannot run when another transactions is active, you can make sure that no transaction is active with commit or rollback methods, for the blueprints APIs are still supported in 3.2.x but are deprecated and will be removed in the next major, so for long term support I would suggest to use the gremlin (orientdb-gremlin dependency and tp3 distribution) or the native OrientDB APIs.

Bye