Throwing an error during a transaction doesn't allow the lock to roll back

Question

Throwing an error during a transaction doesn't allow the lock to roll back

OultimoCoder opened this issue a year ago · 26 comments

I am using the latest version.

Here is my code:

export const createOauthUser = async (
  providerUser: OauthUser,
  databaseConfig: Config['database']
) => {
  const db = getDBClient(databaseConfig)
  try {
    await db.transaction().execute(async (trx) => {
      const userId = await trx
        .insertInto('user')
        .values({
          name: providerUser.name,
          email: providerUser.email,
          is_email_verified: true,
          password: null,
          role: 'user'
        })
        .executeTakeFirstOrThrow()
      console.log(userId)
      await trx
        .insertInto('authorisations')
        .values({
          user_id: Number(userId.insertId),
          provider_type: providerUser.providerType,
          provider_user_id: providerUser.id.toString()
        })
        .executeTakeFirstOrThrow()
      console.log('ended')
      return userId
    })
  } catch (error) {
    try {
      await db.deleteFrom('user').where('user.email', '=', providerUser.email).execute()
    } catch (err) {
      console.log(err)
    }
    console.log('here')
    throw new ApiError(
      httpStatus.FORBIDDEN,
      `Cannot signup with ${providerUser.providerType}, user already exists with that email`
    )
  }
  const user = await getUserByProviderIdType(
    providerUser.id.toString(), providerUser.providerType, databaseConfig
  )
  return User.convert(user)
}

I added the delete statement in the try catch purely to handle an error I am having. If I throw an error during my transaction it should rollback the transaction and release the locks but I instead get this error intermittently (the other times it completes in under 20 seconds):

DatabaseError: target: dictionary-api.-.primary: vttablet: rpc error: code = DeadlineExceeded desc = Lock wait timeout exceeded; try restarting transaction (errno 1205) (sqlstate HY000) (CallerID: ihkeoohzfw2oodprmc73): Sql: "delete from `user` where `user`.email = :user_email", BindVars: {REDACTED}
        at Connection.execute (/backend/node_modules/@planetscale/database/dist/index.js:78:19)
        at processTicksAndRejections (node:internal/process/task_queues:95:5)
        at _PlanetScaleConnection.executeQuery (/backend/node_modules/kysely-planetscale/dist/index.js:92:21)
        at /backend/node_modules/kysely/dist/esm/query-executor/query-executor-base.js:35:28
        at DefaultConnectionProvider.provideConnection (/backend/node_modules/kysely/dist/esm/driver/default-connection-provider.js:10:20)
        at DefaultQueryExecutor.executeQuery (/backend/node_modules/kysely/dist/esm/query-executor/query-executor-base.js:34:16)
        at DeleteQueryBuilder.execute (/backend/node_modules/kysely/dist/esm/query-builder/delete-query-builder.js:392:24)
        at Module.createOauthUser (/backend/src/services/user.service.ts:65:7)
        at Module.loginOrCreateUserWithOauth (/backend/src/services/auth.service.ts:93:19)
        at oauthCallback (/backend/src/controllers/auth/oauth/oauth.controller.ts:27:16) {
      status: 400,
      body: {
        message: 'target: dictionary-api.-.primary: vttablet: rpc error: code = DeadlineExceeded desc = Lock wait timeout exceeded; try restarting transaction (errno 1205) (sqlstate HY000) (CallerID: ihkeoohzfw2oodprmc73): Sql: "delete from `user` where `user`.email = :user_email", BindVars: {REDACTED}',
        code: 'UNKNOWN'
      }
    }

This is the sole test I am running and the sole code executing.

Answer 1 · 2023-06-25T16:47:34.000Z

I found this upgrading from 1.1 to 1.3

Answer 2 · 2023-06-26T09:05:40.000Z

@OultimoCoder this used to work for you with version 1.1? I'm not seeing anything in the diff that would directly change the behavior of transactions: v1.1.0...v1.3.0.

Did you upgrade anything else at the same time (kysely or @planetscale/database)? It still could be something here, but if not, the other thing to check would be if another request was holding a lock that conflicted with the DELETE, I'm not super familiar with what Vitess locks on delete.

Answer 3 · 2023-06-26T15:30:15.000Z

Yeah 1.2 works fine but going to 1.3 breaks it, nothing else changes. I tested just now only incrementing this package version until it broke.

Answer 4 · 2023-06-26T15:31:46.000Z

Interesting - are you setting the useSharedConnection config option? That's the thing that 1.3 added, but it's supposed to be disabled by default.

Answer 5 · 2023-06-26T15:38:18.000Z

I haven't made any changes from 1.1, just upgraded the package to 1.3.

Answer 6 · 2023-06-26T15:42:43.000Z

Could it be related to the changes made that fixed this other issue I reported with transactions in 1.3 #17

Answer 7 · 2023-06-26T15:43:52.000Z

Well v1.2.0 should have that fix included, so if you're not able to reproduce on 1.2, then it seems unrelated to that.

Answer 8 · 2023-06-26T15:44:15.000Z

import { Kysely } from 'kysely'
import { PlanetScaleDialect } from 'kysely-planetscale'
import { ApiKeyTable } from '../models/apiKey.model'
import { AuthProviderTable } from '../models/authProvider.model'
import { EntriesTable } from '../models/entry.model'
import { FieldSourcesTable } from '../models/fieldSource.model'
import { SentencesTable } from '../models/sentence.model'
import { SourcesTable } from '../models/source.model'
import { UserTable } from '../models/user.model'
import { Config } from './config'

let dbClient: Kysely<Database>

export interface Database {
  user: UserTable,
  authorisations: AuthProviderTable
  api_keys: ApiKeyTable,
  sentences: SentencesTable,
  entries: EntriesTable,
  sources: SourcesTable,
  field_sources: FieldSourcesTable
}

export const getDBClient = (databaseConfig: Config['database']) => {
  dbClient =
    dbClient ||
    new Kysely<Database>({
      dialect: new PlanetScaleDialect({
        username: databaseConfig.username,
        password: databaseConfig.password,
        host: databaseConfig.host
      })
    })
  return dbClient
}

This is how my client is instantiated, I have done 0 code changes when testing upgrades.

Answer 9 · 2023-06-26T15:47:06.000Z

And I did a repeat test just to check, 100% something introduced between 1.2 and 1.3 which causes this error.

Answer 10 · 2023-06-26T15:48:03.000Z

Oh actually 1.2.1 included the transaction fixes from #17, 1.2.0 did not. Did you try it with 1.2.1, or just 1.2.0?

Answer 11 · 2023-06-26T15:50:44.000Z

Fails with 1.2.1!

Answer 12 · 2023-06-26T15:52:27.000Z

I actually had another project that was using this package and I created a support ticket on planetscale as two queries at the same time were exceeding the transaction deadline of 20 seconds and they were doing like 1 read and 2 updates in the transaction and averaged a second total and I was very confused.

I would bet its the exact same issue as I was running version 1.3.

Answer 13 · 2023-06-26T15:57:06.000Z

Yeah this is interesting, it's almost as if the rollback is not called? 1.2.1 updated the logic to await the rollback (and everything else), but I'm not sure how that would cause the issue you're seeing.

@koskimas do you see anything I'm doing incorrectly here? v1.2.0...v1.2.1#diff-a2a171449d. I'm not super familiar with what Kysely internals do when an error is thrown inside a transaction.

Answer 14 · 2023-07-02T14:15:20.000Z

@OultimoCoder @jacobwgillespie I was able to replicate this issue but only when I put delete statement into catch block within the transaction scope, but not like it's in the original code.

Now haven't find the root cause for this 20 sec query execution time so far, but there is something what could help @OultimoCoder to mitigate this problem.

Basically, I found that v1.2.0 doesn't perform rollbacks in case of an error within the transaction - I guess that's why in the given code there is a delete statement in case of the error.
Starting with v1.2.1 this seems to be fixed, and rollbacks are performed - inserts aren't committed.

So with rollbacks truly supported now, @OultimoCoder you should be able to get rid of delete statement in catch block and have rollback do it for you.

Answer 15 · 2023-07-02T14:26:52.000Z

I originally found this when running tests using jest and a delete test would run after a test that intentionally errored in the catch block.

The delete code in the catch block is just an easy way of replicating it. I only put it there when I want to replicate the issue.

My issue is the 20 second timeout with the locks not being released after an error in the catch block.

Answer 16 · 2023-07-10T08:52:53.000Z

Any update on this? :(

Answer 17 · 2023-07-10T11:25:53.000Z

I haven't had a chance to investigate further yet - the thing that was fixed in v1.2.1 (#10) was that transaction commands (BEGIN, COMMIT, ROLLBACK) weren't actually being awaited, so they could fail silently.

What I don't know here is if there's an issue inside kysely-planetscale or @planetscale/database that's preventing the normal transaction semantics from working, or if there's a legitimate lock contention in PlanetScale / Vitess that is only now showing up for you because previously it would have silently failed.

Answer 18 · 2023-07-10T11:28:48.000Z

Does the demo code not prove there is no legitimate lock contention?

A failed transaction followed by a delete should always work, no?

I might have an investigate into this too then see if I can find out anymore.

Answer 19 · 2023-07-10T11:29:51.000Z

A failed transaction followed by a delete should always work, no?

Possibly - how much traffic is this app getting? If you're getting this with a single request, then I'd expect it to work, but if there's concurrent requests / queries then it gets more complicated.

Answer 20 · 2023-07-10T11:31:22.000Z

Tested locally. Repeatable and reproducible with a single request or integration test. 0 concurrency. About 75% of the time it times out, other 25% it completes in about 18-19 seconds.

In a separate app, I had the same issue in prod if there were 2 requests at the same time and one transaction failed.

The demo code was just added (I never normally delete in the try catch) to make it reproducible with one request.

Answer 21 · 2023-07-10T11:34:24.000Z

Ill write a reproducible script for this.

Answer 22 · 2023-07-10T13:28:36.000Z

@jacobwgillespie hey!

I have a repo here which reproduces the issue pretty consistently. Very minimal code with a quick readme on how to run it (you just need a fresh planetscale db/branch).

https://github.com/OultimoCoder/kysely-planetscale-issue-20

Answer 23 · 2023-07-11T08:15:40.000Z

@OultimoCoder thank you for that reproduction, that's super helpful - I ran it with quite a bit of logging inside the kysely-planetscale code, and it appeared to be issuing the expected queries, it correctly created a separate connection for the transaction, etc.

I adapted it to use @planetscale/database directly, with raw SQL queries, and it's still timing out due to a lock. You can see the updated code here: https://github.com/jacobwgillespie/kysely-planetscale-issue-20.

So, I think this is either an issue inside @planetscale/database, or some misunderstanding of how locks are working with PlanetScale. I would assume the former, since it feels like a single transaction, with ROLLBACK shouldn't affect anything like you said.

Answer 24 · 2023-07-11T08:27:47.000Z

Thanks so much for looking into this! Should I raise this issue on the @planetscale/database repo and point to both of our codebases?

Answer 25 · 2023-07-11T08:28:39.000Z

I think that makes sense, yeah, they're pretty responsive over there 👍

Answer 26 · 2023-07-29T11:27:59.000Z

This was fixed in @planetscale/database v1.10.0 🎉

https://github.com/planetscale/database-js/releases/tag/v1.10.0