[Bug] SQL throws timeout/sql locked on Kernel Memory operation (add multiple documents, delete index)
Opened this issue · 4 comments
Context / Scenario
Hello,
We are having issues with SQL, most likely with the table [dbo].[KMMemories]. It's hard to perform any operations on it. Retrieving a large number of entries takes quite some time. For instance, getting 1000 items takes 25 seconds on the first attempt and 10 seconds on subsequent runs, which is quite a lot for such a small number of items (I know the JSONs are big). Deleting an item takes about a second. This wouldn't be the biggest problem if Kernel memory operations didn't cause locks and rollbacks, essentially killing the database for a while.
There are two use cases that cause a problem:
-
Ingestion deadlock retry spiral: If we put too many items in the queue to be uploaded in a short span of time, the database table gets temporarily SQL deadlocked or the timeouts are extremely low (we have schedule job that consumes our repository and puts everything to KM in short time). This results in failed uploads and retriggers, which end up in the same failed result. This basically happens after every single retry again and again. We need to wait for all auto retries to fail or forcefully cut the KM app and keep it down for few mins, which we cannot do in production. Only after letting it rest for a while (after the last attempt) can we try to upload again. Also, it will leave residues, as Storage or Search is able to be loaded fully, but the SQL fails and is not loaded.
-
Unable to delete index with just 150 items: If we call the delete index API, it just times out on SQL, most likely due to the slowness of the delete. This results in rollback and retry, and we get into the retry->lock/timeout->fail loop again. We need to manually go to SQL and call delete all for the index, which took 45 minutes for 1400 items.
The main table has 4800 items, and the index that fails to delete is in the screenshot with just 150 entries.
PS: I suspect you are already working on improving performance, but I just wanted to point out a few annoying use cases.
What happened?
Expectations are: We should at least be able to delete the index properly without getting a timeout, or upload without getting timeout (it feels like the current timeout is hardly a few seconds).
Additionally, it would be great if you are able to optimize the table to handle performance better and ensure that multiple indexes do not affect each other (having a DB read/write issue on one index should not block all the indices). Finally, it would be beneficial to handle the retry logic related to SQL of KM better, although that might be challenging to achieve.
Importance
edge case
Platform, Language, Versions
Kubernetes App hosting, C# version, Kernel Memory and SQL 0.92.241112.1 (never version has no mention of SQL in changelog)
DB: Azure SQL DB Elastic pool Standard v2 cores
Relevant log output
[14:45:09.889] warn: Microsoft.KernelMemory.Orchestration.AzureQueues.AzureQueuesPipeline[0] Message '5139450a-2350-4e3c-8bfd-29246b9420ad' processing failed with exception, putting message back in the queue with a delay of 4000 msecs Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (258): Unknown error 258
Microsoft.Data.SqlClient.SqlException:
at Microsoft.Data.SqlClient.SqlConnection.OnError (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1+ConfiguredTaskAwaiter.GetResult (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Microsoft.KernelMemory.MemoryDb.SQLServer.SqlServerMemory+<DeleteIndexAsync>d__10.MoveNext (Microsoft.KernelMemory.MemoryDb.SQLServer, Version=0.92.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Microsoft.KernelMemory.MemoryDb.SQLServer.SqlServerMemory+<DeleteIndexAsync>d__10.MoveNext (Microsoft.KernelMemory.MemoryDb.SQLServer, Version=0.92.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Microsoft.KernelMemory.Handlers.DeleteIndexHandler+<InvokeAsync>d__7.MoveNext (Microsoft.KernelMemory.Core, Version=0.92.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator+<RunPipelineStepAsync>d__8.MoveNext (Microsoft.KernelMemory.Core, Version=0.92.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Microsoft.KernelMemory.Pipeline.DistributedPipelineOrchestrator+<>c__DisplayClass5_0+<<AddHandlerAsync>b__0>d.MoveNext (Microsoft.KernelMemory.Core, Version=0.92.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Microsoft.KernelMemory.Orchestration.AzureQueues.AzureQueuesPipeline+<>c__DisplayClass20_0+<<OnDequeue>b__0>d.MoveNext (Microsoft.KernelMemory.Orchestration.AzureQueues, Version=0.92.0.0, Culture=neutral, PublicKeyToken=f300afd708cefcd3)
Inner exception System.ComponentModel.Win32Exception handled at Microsoft.Data.SqlClient.SqlConnection.OnError:
@marcominerva thoughts?
We need to manually go to SQL and call delete all for the index, which took 45 minutes for 1400 items.
According to this statement, this seems a problem related to SQL itself, than directly depending on Kernel Memory. From your log, I see you're using Azure SQL Database. Could you try using a local SQL Server instance?
The original SQL scripts where authored by @kbeaugrand, do you have any additional idea?
Just FIY: We do not have this problems on the secondary tables holding the contents of the jsons.
KMEmbeddings_botone
KMMemoriesTags_botone
I would still consider looking into SQL timeouts settings, and if possible slightly increase them to some more reasonable thresholds so "weaker" Database can still manage bigger concurrent load or big amount of data.
In regards to the delete of index, I might try to check the code to see the logic whether the delete really tries to create just one single call to DB to delete everything for that index. Because if that is the case, with the timeout now, which seemed like only few seconds after the delete request was fetched from queue, I don't think even fast DB would be able to make it time, especially if we talk about 10k, 100k items.