LINBIT/linstor-server

Database Issue after upgrade: Unable to Restore Database Entry in table LAYER_DRBD_VOLUMES

AleksZimin opened this issue · 6 comments

Hello! We upgraded LINSTOR from version 1.23.0 to 1.24.2. After the update, the LINSTOR controller crashes with the following error in the logs:

Database entry of table LAYER_DRBD_VOLUMES could not be restored. [Report number 653B7EE3-00000-000000]

Here is the content of the error report:

ERROR REPORT 653B7EE3-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.24.2
Build ID:                           2026ead52d3c41c21a79ab5e770ac5460210db7f
Build time:                         2023-10-17T07:16:25+00:00
Error time:                         2023-10-27 09:12:08
Node:                               linstor-controller-5747c777f7-qmmvj

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         LinStorDBRuntimeException
Class canonical name:               com.linbit.linstor.LinStorDBRuntimeException
Generated at:                       Method 'loadAll', Source file 'K8sCrdEngine.java', Line #266

Error message:                      Database entry of table LAYER_DRBD_VOLUMES could not be restored.

ErrorContext:   Details:     Primary key: LAYER_RESOURCE_ID = '187', VLM_NR = '0'


Call backtrace:

    Method                                   Native Class:Line number
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:266
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:170
    loadAll                                  N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver:96
    loadAllLayerVlmData                      N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:317
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:773
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:361
    main                                     N      com.linbit.linstor.core.Controller:609

Caused by:
==========

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'getRscData', Source file 'AbsLayerVlmDataDbDriver.java', Line #66


Call backtrace:

    Method                                   Native Class:Line number
    getRscData                               N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver$VlmParentObjects:66
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdVlmDbDriver:153
    load                                     N      com.linbit.linstor.core.objects.LayerDrbdVlmDbDriver:39
    loadAll                                  N      com.linbit.linstor.dbdrivers.k8s.crd.K8sCrdEngine:237
    loadAll                                  N      com.linbit.linstor.dbdrivers.AbsDatabaseDriver:170
    loadAll                                  N      com.linbit.linstor.core.objects.AbsLayerVlmDataDbDriver:96
    loadAllLayerVlmData                      N      com.linbit.linstor.core.objects.AbsLayerRscDataDbDriver:317
    loadLayerObects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:773
    loadCoreObjects                          N      com.linbit.linstor.dbdrivers.DatabaseLoader:640
    loadCoreObjects                          N      com.linbit.linstor.core.DbDataInitializer:169
    initialize                               N      com.linbit.linstor.core.DbDataInitializer:101
    startSystemServices                      N      com.linbit.linstor.core.ApplicationLifecycleManager:87
    start                                    N      com.linbit.linstor.core.Controller:361
    main                                     N      com.linbit.linstor.core.Controller:609


END OF ERROR REPORT.

As a temporary measure, we were able to get LINSTOR operational by restoring from a backup and reverting back to version 1.23.0.
Additionally, it's worth mentioning that our cluster utilizes encryption, which may or may not be relevant to this issue.
I have sent database backups before and after upgrade to @ghernadi
Your assistance in resolving this issue would be highly appreciated. Thank you!

Hello,
Thank you for the db backup, although it was not that useful as I hoped it to be. The issue is that even your backup made before the upgrade already contains some weird but orphaned entries. That means that I cannot really figure out how those entries got orphaned in the first place.
Pre 1.24.0 orphaned entries in the layer* tables were silently ignored (that's why they are not causing issues for you in 1.23.0), but with 1.24.0 we had to rework all database drivers to properly implement the export/import feature. Now those orphaned entries are a problem due to a change in how the data is loaded.

However, by trying to cleanup your database backup from those orphaned entries to get the controller starting again, I noticed that none of the entries in the layerluksvolumes table are properly linked (i.e. all entries of that table are orphaned). That means that none of the resources is encrypted, although 5 resources have orphaned entries in the layerluksvolumes table.

Can you please verify (i.e. by using lsblk) if the resources you think should be encrypted actually are encrypted on the satellite(s)?

Hello,

Thank you for your response. Currently, there indeed are no encrypted resources. I checked with my colleagues, and they informed me that while we had encrypted resources in the past, we have since decided to discontinue using such resources.

Alright, in that case, here is the list of orphaned entries I had to delete in order to make the controller start again:

layerdrbdresources.internal.linstor.linbit.com
38b2d03f3256502b1e9db02b2d12aa27a46033ffe6d8c0ef0f2cf6b1530be9d8    // layer_resource_id 187, possibly related resource: PVC-4E1D5CB7-8F4E-4099-BB18-57A96720AC38 on node A-HV-1

layerdrbdvolumes.internal.linstor.linbit.com
d12d1b46eb96c378080b01bd40268b657dc875dd535bd366c99baadafaf9e501    // layer_resource_id 187, possibly related resource: PVC-4E1D5CB7-8F4E-4099-BB18-57A96720AC38 on node A-HV-1

layerluksvolumes.internal.linstor.linbit.com
55c2d358233e136ca874a89077735bb28cfea3fe174de785a525123a1ed30242    // layer_resource_id 128, possibly related resource: PVC-06363BD2-4BD6-4AC9-864A-813DC14889AF on node B-HV-3
59db9ac99b3b60062c41d24e2a93cbde31df44dd3a25c17b4690c8ff606aa19e    // layer_resource_id 171, possibly related resource: PVC-291F30CA-3EC3-4A74-963C-BC93891934A7 on node A-HV-1
98ffd3e4a7dbed2a98a19026543f0ed7255f0afea23fdcb66b8555f3e902b796    // layer_resource_id 180, possibly related resource: PVC-7E11A742-82B4-4508-9BDB-4A506D767D05 on node B-HV-8
b10a42a43820fa6125bb5a40a214dada1b77e261c74ea6ff05447a01d6491ff3    // layer_resource_id 188, possibly related resource: PVC-1016C04C-64FD-490F-B898-B4C723E56627 on node C-HV-5
af1e91af8629c4c872ff18f9e94a402e3c75a5bc687179022b10bc06f35d4ed1    // layer_resource_id 409, possibly related resource: PVC-8CF70609-68AB-4B8B-AEF2-8A1D33E1CAB4 on node A-HV-9

layerstoragevolumes.internal.linstor.linbit.com
b6d80e634e3edf89f2261be752de899ced710dc15e1537e806da515b27b8c089    // layer_resource_id 172, possibly related resource: PVC-1456CDC1-E899-40DD-AAE8-163A8BA399C5 on node A-HV-9

Please make another backup before you attempt to delete this entries, just to be sure.

I have also noted the (possibly) related resources of those deleted entries, so please make sure those resources are fine after the deletion.

Hello! Thank you very much for the solution provided. It helped us, and all resources are fine after deleting the entries.