depot/kysely-planetscale

Throwing an error during a transaction doesn't allow the lock to roll back

OultimoCoder opened this issue ยท 26 comments

I am using the latest version.

Here is my code:

export const createOauthUser = async (
  providerUser: OauthUser,
  databaseConfig: Config['database']
) => {
  const db = getDBClient(databaseConfig)
  try {
    await db.transaction().execute(async (trx) => {
      const userId = await trx
        .insertInto('user')
        .values({
          name: providerUser.name,
          email: providerUser.email,
          is_email_verified: true,
          password: null,
          role: 'user'
        })
        .executeTakeFirstOrThrow()
      console.log(userId)
      await trx
        .insertInto('authorisations')
        .values({
          user_id: Number(userId.insertId),
          provider_type: providerUser.providerType,
          provider_user_id: providerUser.id.toString()
        })
        .executeTakeFirstOrThrow()
      console.log('ended')
      return userId
    })
  } catch (error) {
    try {
      await db.deleteFrom('user').where('user.email', '=', providerUser.email).execute()
    } catch (err) {
      console.log(err)
    }
    console.log('here')
    throw new ApiError(
      httpStatus.FORBIDDEN,
      `Cannot signup with ${providerUser.providerType}, user already exists with that email`
    )
  }
  const user = await getUserByProviderIdType(
    providerUser.id.toString(), providerUser.providerType, databaseConfig
  )
  return User.convert(user)
}

I added the delete statement in the try catch purely to handle an error I am having. If I throw an error during my transaction it should rollback the transaction and release the locks but I instead get this error intermittently (the other times it completes in under 20 seconds):

DatabaseError: target: dictionary-api.-.primary: vttablet: rpc error: code = DeadlineExceeded desc = Lock wait timeout exceeded; try restarting transaction (errno 1205) (sqlstate HY000) (CallerID: ihkeoohzfw2oodprmc73): Sql: "delete from `user` where `user`.email = :user_email", BindVars: {REDACTED}
        at Connection.execute (/backend/node_modules/@planetscale/database/dist/index.js:78:19)
        at processTicksAndRejections (node:internal/process/task_queues:95:5)
        at _PlanetScaleConnection.executeQuery (/backend/node_modules/kysely-planetscale/dist/index.js:92:21)
        at /backend/node_modules/kysely/dist/esm/query-executor/query-executor-base.js:35:28
        at DefaultConnectionProvider.provideConnection (/backend/node_modules/kysely/dist/esm/driver/default-connection-provider.js:10:20)
        at DefaultQueryExecutor.executeQuery (/backend/node_modules/kysely/dist/esm/query-executor/query-executor-base.js:34:16)
        at DeleteQueryBuilder.execute (/backend/node_modules/kysely/dist/esm/query-builder/delete-query-builder.js:392:24)
        at Module.createOauthUser (/backend/src/services/user.service.ts:65:7)
        at Module.loginOrCreateUserWithOauth (/backend/src/services/auth.service.ts:93:19)
        at oauthCallback (/backend/src/controllers/auth/oauth/oauth.controller.ts:27:16) {
      status: 400,
      body: {
        message: 'target: dictionary-api.-.primary: vttablet: rpc error: code = DeadlineExceeded desc = Lock wait timeout exceeded; try restarting transaction (errno 1205) (sqlstate HY000) (CallerID: ihkeoohzfw2oodprmc73): Sql: "delete from `user` where `user`.email = :user_email", BindVars: {REDACTED}',
        code: 'UNKNOWN'
      }
    }

This is the sole test I am running and the sole code executing.

I found this upgrading from 1.1 to 1.3

@OultimoCoder this used to work for you with version 1.1? I'm not seeing anything in the diff that would directly change the behavior of transactions: v1.1.0...v1.3.0.

Did you upgrade anything else at the same time (kysely or @planetscale/database)? It still could be something here, but if not, the other thing to check would be if another request was holding a lock that conflicted with the DELETE, I'm not super familiar with what Vitess locks on delete.

Yeah 1.2 works fine but going to 1.3 breaks it, nothing else changes. I tested just now only incrementing this package version until it broke.

Interesting - are you setting the useSharedConnection config option? That's the thing that 1.3 added, but it's supposed to be disabled by default.

I haven't made any changes from 1.1, just upgraded the package to 1.3.

Could it be related to the changes made that fixed this other issue I reported with transactions in 1.3 #17

Well v1.2.0 should have that fix included, so if you're not able to reproduce on 1.2, then it seems unrelated to that.

import { Kysely } from 'kysely'
import { PlanetScaleDialect } from 'kysely-planetscale'
import { ApiKeyTable } from '../models/apiKey.model'
import { AuthProviderTable } from '../models/authProvider.model'
import { EntriesTable } from '../models/entry.model'
import { FieldSourcesTable } from '../models/fieldSource.model'
import { SentencesTable } from '../models/sentence.model'
import { SourcesTable } from '../models/source.model'
import { UserTable } from '../models/user.model'
import { Config } from './config'

let dbClient: Kysely<Database>

export interface Database {
  user: UserTable,
  authorisations: AuthProviderTable
  api_keys: ApiKeyTable,
  sentences: SentencesTable,
  entries: EntriesTable,
  sources: SourcesTable,
  field_sources: FieldSourcesTable
}

export const getDBClient = (databaseConfig: Config['database']) => {
  dbClient =
    dbClient ||
    new Kysely<Database>({
      dialect: new PlanetScaleDialect({
        username: databaseConfig.username,
        password: databaseConfig.password,
        host: databaseConfig.host
      })
    })
  return dbClient
}

This is how my client is instantiated, I have done 0 code changes when testing upgrades.

And I did a repeat test just to check, 100% something introduced between 1.2 and 1.3 which causes this error.

Oh actually 1.2.1 included the transaction fixes from #17, 1.2.0 did not. Did you try it with 1.2.1, or just 1.2.0?

Fails with 1.2.1!

I actually had another project that was using this package and I created a support ticket on planetscale as two queries at the same time were exceeding the transaction deadline of 20 seconds and they were doing like 1 read and 2 updates in the transaction and averaged a second total and I was very confused.

I would bet its the exact same issue as I was running version 1.3.

Yeah this is interesting, it's almost as if the rollback is not called? 1.2.1 updated the logic to await the rollback (and everything else), but I'm not sure how that would cause the issue you're seeing.

@koskimas do you see anything I'm doing incorrectly here? v1.2.0...v1.2.1#diff-a2a171449d. I'm not super familiar with what Kysely internals do when an error is thrown inside a transaction.

sromic commented

@OultimoCoder @jacobwgillespie I was able to replicate this issue but only when I put delete statement into catch block within the transaction scope, but not like it's in the original code.

Now haven't find the root cause for this 20 sec query execution time so far, but there is something what could help @OultimoCoder to mitigate this problem.

Basically, I found that v1.2.0 doesn't perform rollbacks in case of an error within the transaction - I guess that's why in the given code there is a delete statement in case of the error.
Starting with v1.2.1 this seems to be fixed, and rollbacks are performed - inserts aren't committed.

So with rollbacks truly supported now, @OultimoCoder you should be able to get rid of delete statement in catch block and have rollback do it for you.

I originally found this when running tests using jest and a delete test would run after a test that intentionally errored in the catch block.

The delete code in the catch block is just an easy way of replicating it. I only put it there when I want to replicate the issue.

My issue is the 20 second timeout with the locks not being released after an error in the catch block.

Any update on this? :(

I haven't had a chance to investigate further yet - the thing that was fixed in v1.2.1 (#10) was that transaction commands (BEGIN, COMMIT, ROLLBACK) weren't actually being awaited, so they could fail silently.

What I don't know here is if there's an issue inside kysely-planetscale or @planetscale/database that's preventing the normal transaction semantics from working, or if there's a legitimate lock contention in PlanetScale / Vitess that is only now showing up for you because previously it would have silently failed.

Does the demo code not prove there is no legitimate lock contention?

A failed transaction followed by a delete should always work, no?

I might have an investigate into this too then see if I can find out anymore.

A failed transaction followed by a delete should always work, no?

Possibly - how much traffic is this app getting? If you're getting this with a single request, then I'd expect it to work, but if there's concurrent requests / queries then it gets more complicated.

Tested locally. Repeatable and reproducible with a single request or integration test. 0 concurrency. About 75% of the time it times out, other 25% it completes in about 18-19 seconds.

In a separate app, I had the same issue in prod if there were 2 requests at the same time and one transaction failed.

The demo code was just added (I never normally delete in the try catch) to make it reproducible with one request.

Ill write a reproducible script for this.

@jacobwgillespie hey!

I have a repo here which reproduces the issue pretty consistently. Very minimal code with a quick readme on how to run it (you just need a fresh planetscale db/branch).

https://github.com/OultimoCoder/kysely-planetscale-issue-20

@OultimoCoder thank you for that reproduction, that's super helpful - I ran it with quite a bit of logging inside the kysely-planetscale code, and it appeared to be issuing the expected queries, it correctly created a separate connection for the transaction, etc.

I adapted it to use @planetscale/database directly, with raw SQL queries, and it's still timing out due to a lock. You can see the updated code here: https://github.com/jacobwgillespie/kysely-planetscale-issue-20.

So, I think this is either an issue inside @planetscale/database, or some misunderstanding of how locks are working with PlanetScale. I would assume the former, since it feels like a single transaction, with ROLLBACK shouldn't affect anything like you said.

Thanks so much for looking into this! Should I raise this issue on the @planetscale/database repo and point to both of our codebases?

I think that makes sense, yeah, they're pretty responsive over there ๐Ÿ‘

This was fixed in @planetscale/database v1.10.0 ๐ŸŽ‰

https://github.com/planetscale/database-js/releases/tag/v1.10.0