neo4j/neo4j-javascript-driver

Concurrent queries blocking release for connections

miguel-angel-castro-cortes opened this issue · 12 comments

I got connection reset by peer errors.

Neo4j Version: 4.2.2 Community
Neo4j Mode: Single instance
Driver version: JS driver 4.2.3
Operating System: Ubuntu 20

Steps to reproduce

  1. Start Neo4j on Neo4j Desktop
  2. Use the logger with debug level in config object of the driver.
  3. Send multiple concurrent requests, for example 5 iterations of a 100 requests at the same time.
  4. Wait for them to finish.
  5. See in the logs that some of the connections don't release.

Expected behavior

When concurrent requests (each request makes a query to the neo4j database) are made and finished, then connections should be released.

Actual behavior

When there are multiple concurrent requests the connections are not released after the requests are finished.

Notes

Eventually when more and more connections are "blocked" (not released), the following error will pop up:
Neo4jError: Connection acquisition timed out in 60000 ms. Poos status: Active conn count = 100, Idle conn count = 0.

Hi @miguel-angel-castro-cortes,

Thanks for the reporting.

However the connection is only released when await session.close() been called.

This behaviour is described in details in this section of the driver manual.

Could you verify if the issue is solved by closing the session?

Thanks for the reply, but I'm already closing the sessions. Here is a small code piece of how I'm doing it at the end of each transaction:

try {
      await txc.commit();
    } catch (error) {
      await txc.rollback();
      throw new Error(errorMessage + `, error: ${error}`);
    } finally {
      await session.close();
    }

The code snippet seems right.

Could you provide some small piece of code which I can run and reproduce the error?

Okay I did a small project to reproduce the bug, https://codesandbox.io/s/vigorous-chebyshev-78ok4?file=/src/app.module.ts
Things to have into account:

  • You will have to run this locally, unless you know how to run neo4j in codesandbox (which I don't), also is useful to run it locally to send the multiple concurrent requests.
  • I have a small maxConnectionPoolSize in the driver config, but it is not necessary, you can use the default (100).
  • I assume the following config in your neo4j:
{
      scheme: 'bolt',
      host: 'localhost',
      password: 'whateverPasswordYouWant',
      port: '7687',
      username: 'neo4j',
      database: 'neo4j',
    }
  • Then you can use postman or whatever method you want to run multiple concurrent requests, in this example you may want to do a GET to the route /. Be sure to not do only sequential requests!
  • The most important thing is to notice that in the logs you are going to get something like debug Connection [x][..., the idea (as far as I can tell) is that when one person is doing requests to the database it is going to take a connection, for example Connection [1], and run all the queries and necessary stuff in that connection. If another person at the same time is doing requests then they are going to use Connection [2] and so on. The thing is that if, say, the first person stops doing requests, after some time the connection should be idle, which means that if a new person (or even the same) starts doing requests then it can use Connection [1] to do them. But for some reason the connection is not getting idle or released (even if the debug logs says it is 'released to the pool') . If a person do new requests it is going to use a new connection in this example Connection [4], instead of using the Connection [1] that should have been free. Eventually new requests are going to use all connections until this error pops: Neo4jError: Connection acquisition timed out in 60000 ms

Hi @miguel-angel-castro-cortes, sorry for the delayed response.

About the algorithm to get connections from the pool, it always get the last released connection. So if it the last released was Connection [4], it will acquire this one for the next query. I've tried your code locally against a local database running more than 800 at same time and no issue at all, I've run against the movies database. It could be something with the time it's getting to run the query and the number of requests in parallel which could exhaust the connection pool.

PS: it applies for connections to the same server, in cluster environments a previous step is done to select the server and the get the connection.

Thanks for the response!
So then I'm ignoring something because the thing is that sometimes the connections that are used are never released as far as I can tell, which eventually leads to this error Poos status: Active conn count = 100, Idle conn count = 0. And it is not because of a huge load, if, for example, I do 2 concurrent requests which uses Connection [0] and Connection [1], then only Connection [1] gets released sometimes. Then, if after, say, 15 min, I do another 2 concurrent requests it is going to use Connection [1] (that got released) and now it is going to continue with Connection [2], without using Connection [0] (that didn't got released as far as I can tell). Which eventually keeps going up until there are not idle connections, even if there are only a few requests divided in some time.

The Connection [0] can never be used anymore because it got closed for some reason (ttl, disconnection, network issues, and so on). Then the pool will spawn a new connection with a new id to replace it, but respecting the size of the pool.

Monday I will try to create a long running scenario and let it running in my dev machine to monitor this issue.

PS: I also spotted lack of await here this.neo4jService.transactionCommit(transaction, session, 'error');. But it should not be the cause.

Okay awesome. I think I get it now, so if previous connections are discarded, and then it will create new connection so even if the max pool size is, for example, 5, then I can have Connection [6] or greater. But then the problem is that for some reason connections are not being recognize as idle, which is weird because when you commit the transaction and close the session then the connection should not remain active, I suppose.

Hi @miguel-angel-castro-cortes,

I've kept app running continuously with 1024 requests in parallel for some hours and nothing abnormal happened.

Could you do some measurements ? I need the number of requests per second and the requests duration (average, median max, min, etc). Probably you are reaching the limit for your current pool configuration and node processes.

I see, well it happens most of the time with even two concurrent requests, but as you haven't been able to reproduce the bug (most likely because the small project I did is not enough) then give me sometime to see if I can give you a code snippet that reproduces the bug always. Which might be a little challenging because the original project from which the bug originates is a lot more complex and bigger, but in any case I will keep you updated.

Hi, I've been pretty much out of time, so I haven't been able to make the perfect code snippet, but I figured that for now I can post my logs to demonstrate the issue, maybe they can be helpful:

[1:06:55 PM] Starting compilation in watch mode...

[1:06:59 PM] Found 0 errors. Watching for file changes.

[Nest] 29573   - 07/12/2021, 1:07:00 PM   [NestFactory] Starting Nest application...
info Direct driver 0 created for server address localhost:7687
debug Connection [0][] created towards localhost:7687
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] SentryModule dependencies initialized +82ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] PubSubModule dependencies initialized +2ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] HttpModule dependencies initialized +2ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] SentryCoreModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] ConfigHostModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] DiscoveryModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] CacheModule dependencies initialized +2ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] FeedModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] TerminusModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] AppModule dependencies initialized +2ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] ConfigModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] ConfigModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] ScheduleModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] CacheModule dependencies initialized +2ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] UsersModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] GraphQLSchemaBuilderModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] MetadataModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:00 PM   [InstanceLoader] GraphQLModule dependencies initialized +1ms
debug Connection [0][] negotiated protocol version 4.2
debug Connection [0][] C: HELLO {user_agent: 'neo4j-javascript/4.2.3', ...}
debug Connection [0][] S: SUCCESS {"signature":112,"fields":[{"server":"Neo4j/4.2.2","connection_id":"bolt-265"}]}
debug Connection [0][bolt-265] created for the pool localhost:7687
debug Connection [0][bolt-265] C: RESET
debug Connection [0][bolt-265] S: SUCCESS {"signature":112,"fields":[{}]}
debug Connection [0][bolt-265] released to the pool localhost:7687
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [InstanceLoader] Neo4jModule dependencies initialized +29ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [InstanceLoader] ExternalReferenceModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [InstanceLoader] FeedReferenceModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [InstanceLoader] HealthModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [InstanceLoader] GraphModule dependencies initialized +0ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [InstanceLoader] SourceModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [InstanceLoader] LinkModule dependencies initialized +0ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [InstanceLoader] EntityModule dependencies initialized +1ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [RoutesResolver] AppController {}: +6ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [RouterExplorer] Mapped {, GET} route +3ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [RoutesResolver] HealthController {/health}: +1ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [RouterExplorer] Mapped {/health, GET} route +0ms
[Nest] 29573   - 07/12/2021, 1:07:01 PM   [NestApplication] Nest application successfully started +191ms
debug Connection [0][bolt-265] acquired from the pool localhost:7687
debug Connection [0][bolt-265] C: BEGIN {"db":"neo4j"}
debug Connection [0][bolt-265] C: RUN 
  match (node :INIT_GRAPH { graph_id: $graph_id })
  return node
  limit tointeger($limit)
 {"graph_id":"graph_v1","limit":50} {}
debug Connection [0][bolt-265] C: PULL {"n":{"low":1000,"high":0}}
debug Connection [0][bolt-265] C: RUN 
  match (node :INIT_GRAPH { graph_id: $graph_id })
  return node
  limit tointeger($limit)
 {"graph_id":"graph_v1","limit":50} {}
debug Connection [0][bolt-265] C: PULL {"n":{"low":1000,"high":0}}
debug Connection [0][bolt-265] S: SUCCESS {"signature":112,"fields":[{}]}
debug Connection [0][bolt-265] S: SUCCESS {"signature":112,"fields":[{"t_first":0,"fields":["node"],"qid":0}]}
debug Connection [0][bolt-265] S: RECORD {"signature":113,"fields":[[{"identity":0,"labels":["INIT_GRAPH"],"properties":{"graph_id":"graph_v1"}}]]}
debug Connection [0][bolt-265] S: SUCCESS {"signature":112,"fields":[{"type":"r","t_last":0,"db":"neo4j"}]}
debug Connection [0][bolt-265] S: SUCCESS {"signature":112,"fields":[{"t_first":1,"fields":["node"],"qid":1}]}
debug Connection [0][bolt-265] S: RECORD {"signature":113,"fields":[[{"identity":0,"labels":["INIT_GRAPH"],"properties":{"graph_id":"graph_v1"}}]]}
debug Connection [0][bolt-265] S: SUCCESS {"signature":112,"fields":[{"type":"r","t_last":0,"db":"neo4j"}]}
debug Connection [0][bolt-265] C: RUN 
  match (entity :Entity {
    graph_id: $graph_id
  })-[:COMES_FROM_FEED]->(feedReference :FeedReference)
  where $startDate <= entity.time <= $endDate and feedReference.feed_id in $feedIds
  with entity, feedReference
  order by entity.time desc, entity.weight desc
  limit tointeger($limit_entities)
  with collect(entity) as entities, feedReference
  unwind entities as fromEntity
  unwind entities as toEntity
  match
      (fromEntity)-[link {graph_id: $graph_id}]->(toEntity)
  where $startDate <= link.time <= $endDate and link.link_id in feedReference.links_id
  return collect(link) as links, entities
  limit tointeger($limit_links)
 {"graph_id":"graph_v1","limit_entities":50,"limit_links":250,"startDate":"2021-07-12T17:37:08.000Z","endDate":"2021-07-12T18:07:08.000Z","feedIds":[]} {}
debug Connection [0][bolt-265] C: PULL {"n":{"low":1000,"high":0}}
debug Connection [0][bolt-265] C: RUN 
  match (entity :Entity {
    graph_id: $graph_id
  })-[:COMES_FROM_FEED]->(feedReference :FeedReference)
  where $startDate <= entity.time <= $endDate and feedReference.feed_id in $feedIds
  with entity, feedReference
  order by entity.time desc, entity.weight desc
  limit tointeger($limit_entities)
  with collect(entity) as entities, feedReference
  unwind entities as fromEntity
  unwind entities as toEntity
  match
      (fromEntity)-[link {graph_id: $graph_id}]->(toEntity)
  where $startDate <= link.time <= $endDate and link.link_id in feedReference.links_id
  return collect(link) as links, entities
  limit tointeger($limit_links)
 {"graph_id":"graph_v1","limit_entities":50,"limit_links":250,"startDate":"2021-07-12T17:37:08.000Z","endDate":"2021-07-12T18:07:08.000Z","feedIds":[]} {}
debug Connection [0][bolt-265] C: PULL {"n":{"low":1000,"high":0}}
debug Connection [0][bolt-265] S: SUCCESS {"signature":112,"fields":[{"t_first":129,"fields":["links","entities"],"qid":2}]}
debug Connection [0][bolt-265] S: SUCCESS {"signature":112,"fields":[{"type":"r","t_last":2,"db":"neo4j"}]}
debug Connection [0][bolt-265] S: SUCCESS {"signature":112,"fields":[{"t_first":0,"fields":["links","entities"],"qid":3}]}
debug Connection [0][bolt-265] S: SUCCESS {"signature":112,"fields":[{"type":"r","t_last":2,"db":"neo4j"}]}
[Nest] 29573   - 07/12/2021, 1:08:09 PM   [ExceptionsHandler] Connection acquisition timed out in 60000 ms. Poos status: Active conn count = 1, Idle conn count = 0. +68569ms
Neo4jError: Connection acquisition timed out in 60000 ms. Poos status: Active conn count = 1, Idle conn count = 0.

    at captureStacktrace (/home/macastro/Documents/gitlab/sg-backend/node_modules/neo4j-driver/lib/result.js:281:15)
    at new Result (/home/macastro/Documents/gitlab/sg-backend/node_modules/neo4j-driver/lib/result.js:68:19)
    at newCompletedResult (/home/macastro/Documents/gitlab/sg-backend/node_modules/neo4j-driver/lib/transaction.js:455:10)
    at Object.run (/home/macastro/Documents/gitlab/sg-backend/node_modules/neo4j-driver/lib/transaction.js:289:14)
    at Transaction.run (/home/macastro/Documents/gitlab/sg-backend/node_modules/neo4j-driver/lib/transaction.js:125:32)
    at Neo4jService.read (/home/macastro/Documents/gitlab/sg-backend/dist/src/neo4j/neo4j.service.js:64:42)
    at LinkService.fetchLink (/home/macastro/Documents/gitlab/sg-backend/dist/src/graph/link/link.service.js:42:61)
    at LinkService.getLinkModal (/home/macastro/Documents/gitlab/sg-backend/dist/src/graph/link/link.service.js:90:44)
    at LinkResolver.getLinkModal (/home/macastro/Documents/gitlab/sg-backend/dist/src/graph/link/link.resolver.js:24:33)
    at /home/macastro/Documents/gitlab/sg-backend/node_modules/@nestjs/core/helpers/external-context-creator.js:69:33
[Nest] 29573   - 07/12/2021, 1:08:21 PM   [ExceptionsHandler] Connection acquisition timed out in 60000 ms. Poos stat

At the end there was some async problems with some commits executions for the transactions which solved the problem. Thanks a lot for the support!