11 is broken on NFS
aep opened this issue · 6 comments
Regression description
when upgrading from 10 to 11 and /data is on NFS you'll get
FoundryVTT | 2023-07-18 18:01:49 | [error] Database is not open
mentiones this is due to the change to leveldb.
leveldb upstream works fine on NFS so this is suspicious.
according to discord this bug is intentional to cause issues for professional setups.
However, that's discord, and i feel like the more likely explanation is that they're wrapping leveldb in some exclusive locking to avoid accidentally starting multiple instances.
opening this issue here because it's more likely there's other people in this community with the same issue.
unfortunately their javascript is somewhat obfuscated, making it harder to find the issue.
Worked up to version
10
Stopped working in version
11
Relevant log output
oundryVTT | 2023-07-18 18:01:49 | [error] Failed to connect to database "messages": Database is not open
FoundryVTT | 2023-07-18 18:01:49 | [error] Failed to connect to database "messages": Database is not open
Error: Failed to connect to database "messages": Database is not open
at maybeOpened (/home/foundry/resources/app/node_modules/abstract-level/abstract-level.js:133:18)
at /home/foundry/resources/app/node_modules/abstract-level/abstract-level.js:160:13
at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
FoundryVTT | 2023-07-18 18:01:49 | [error] Database is not open
Error: Database is not open
at LevelDatabase.keys (/home/foundry/resources/app/node_modules/abstract-level/abstract-level.js:697:13)
at LevelDatabase.compactFull (file:///home/foundry/resources/app/dist/database/backend/level-database.mjs:1:1647)
at LevelDatabase.close (file:///home/foundry/resources/app/dist/database/backend/level-database.mjs:1:1171)
at Module.disconnect (file:///home/foundry/resources/app/dist/database/database.mjs:1:1658)
at World.deactivate (file:///home/foundry/resources/app/dist/packages/world.mjs:1:11206)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Code of Conduct
- I agree to follow this project's Code of Conduct
strace reveals its not even using flock. its creating a LOCK file which suceeds
/home/foundry # strace -f -p 87 2>&1 | grep messages
[pid 87] access("/data/Data/worlds/kmkm/data/messages", F_OK) = 0
[pid 94] mkdir("/data/Data/worlds/kmkm/data/messages", 0777 <unfinished ...>
[pid 96] statx(AT_FDCWD, "/data/Data/worlds/kmkm/data/messages", AT_STATX_SYNC_AS_STAT, STATX_ALL, <unfinished ...>
[pid 95] mkdir("/data/Data/worlds/kmkm/data/messages", 0755 <unfinished ...>
[pid 95] rename("/data/Data/worlds/kmkm/data/messages/LOG", "/data/Data/worlds/kmkm/data/messages/LOG.old" <unfinished ...>
[pid 95] open("/data/Data/worlds/kmkm/data/messages/LOG", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666 <unfinished ...>
[pid 95] mkdir("/data/Data/worlds/kmkm/data/messages", 0755 <unfinished ...>
[pid 95] open("/data/Data/worlds/kmkm/data/messages/LOCK", O_RDWR|O_CREAT|O_LARGEFILE, 0644 <unfinished ...>
[pid 95] access("/data/Data/worlds/kmkm/data/messages/CURRENT", F_OK <unfinished ...>
[pid 95] open("/data/Data/worlds/kmkm/data/messages/CURRENT", O_RDONLY|O_LARGEFILE) = 39
[pid 95] open("/data/Data/worlds/kmkm/data/messages/MANIFEST-000002", O_RDONLY|O_LARGEFILE) = 39
[pid 95] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 39
[pid 95] open("/data/Data/worlds/kmkm/data/messages/000003.log", O_RDONLY|O_LARGEFILE) = 39
[pid 95] readv(39, [{iov_base="|\360\244_ \r\1\1\0\0\0\0\0\0\0\2\0\0\0\1\32!messages!J"..., iov_len=32767}, {iov_base="", iov_len=1024}], 2) = 3367
[pid 95] open("/data/Data/worlds/kmkm/data/messages/000005.ldb", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 39
[pid 95] writev(39, [{iov_base="", iov_len=0}, {iov_base="\242\32\200\0\"\253\f!messages!JMUZweuXaLnRGbw"..., iov_len=1950}], 2) = 1950
[pid 95] open("/data/Data/worlds/kmkm/data/messages/000005.ldb", O_RDONLY|O_LARGEFILE) = 39
[pid 95] stat("/data/Data/worlds/kmkm/data/messages/000005.ldb", {st_mode=S_IFREG|0644, st_size=2107, ...}) = 0
[pid 95] open("/data/Data/worlds/kmkm/data/messages/000005.sst", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
[pid 95] unlink("/data/Data/worlds/kmkm/data/messages/000005.ldb") = 0
compared with a working tmpfs mount, i see no difference
[pid 87] access("/data/Data/worlds/kmkm/data/messages", F_OK) = 0
[pid 96] mkdir("/data/Data/worlds/kmkm/data/messages", 0777 <unfinished ...>
[pid 94] statx(AT_FDCWD, "/data/Data/worlds/kmkm/data/messages", AT_STATX_SYNC_AS_STAT, STATX_ALL, <unfinished ...>
[pid 97] mkdir("/data/Data/worlds/kmkm/data/messages", 0755 <unfinished ...>
[pid 97] rename("/data/Data/worlds/kmkm/data/messages/LOG", "/data/Data/worlds/kmkm/data/messages/LOG.old" <unfinished ...>
[pid 97] open("/data/Data/worlds/kmkm/data/messages/LOG", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666 <unfinished ...>
[pid 97] mkdir("/data/Data/worlds/kmkm/data/messages", 0755 <unfinished ...>
[pid 97] open("/data/Data/worlds/kmkm/data/messages/LOCK", O_RDWR|O_CREAT|O_LARGEFILE, 0644) = 39
[pid 97] access("/data/Data/worlds/kmkm/data/messages/CURRENT", F_OK) = 0
[pid 97] open("/data/Data/worlds/kmkm/data/messages/CURRENT", O_RDONLY|O_LARGEFILE) = 40
[pid 97] open("/data/Data/worlds/kmkm/data/messages/MANIFEST-000004", O_RDONLY|O_LARGEFILE) = 40
[pid 97] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 40
[pid 97] open("/data/Data/worlds/kmkm/data/messages/000007.log", O_RDONLY|O_LARGEFILE) = 40
[pid 97] open("/data/Data/worlds/kmkm/data/messages/000011.log", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 40
[pid 97] open("/data/Data/worlds/kmkm/data/messages/MANIFEST-000010", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 41
[pid 97] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE) = 42
[pid 97] open("/data/Data/worlds/kmkm/data/messages/000010.dbtmp", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 42
[pid 97] rename("/data/Data/worlds/kmkm/data/messages/000010.dbtmp", "/data/Data/worlds/kmkm/data/messages/CURRENT") = 0
[pid 97] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 42
[pid 97] unlink("/data/Data/worlds/kmkm/data/messages/000007.log") = 0
[pid 97] unlink("/data/Data/worlds/kmkm/data/messages/MANIFEST-000004") = 0
[pid 87] access("/data/Data/worlds/kmkm/data/messages.db", F_OK <unfinished ...>
[pid 97] open("/data/Data/worlds/kmkm/data/messages/000009.ldb", O_RDONLY|O_LARGEFILE <unfinished ...>
[pid 97] stat("/data/Data/worlds/kmkm/data/messages/000009.ldb", <unfinished ...>
[pid 96] open("/data/Data/worlds/kmkm/data/messages/000012.log", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666 <unfinished ...>
[pid 98] unlink("/data/Data/worlds/kmkm/data/messages/000013.ldb") = -1 ENOENT (No such file or directory)
[pid 98] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE <unfinished ...>
[pid 98] open("/data/Data/worlds/kmkm/data/messages", O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY <unfinished ...>
[pid 98] unlink("/data/Data/worlds/kmkm/data/messages/000011.log") = 0
I saw your message on #v11-update-troubleshooting
.
Interestingly my production install has always used an NFS share and it seems to still be working correctly with 11.305
.
I'm on Linux using NFS 4.2.
If it helps, my mount point looks like this:
nas:/homes/username on /home/username type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=x.x.x.x,local_lock=none,addr=y.y.y.y)
Thanks for digging into this in more detail.
i got a workaround that works for me. but i'm not sure if its safe yet.
the underlying issue is that the way they use the db library (wrong, according to the libs dev),
causes a weird combination of syscalls that may lead to data loss on NFS.
but i really dont know and my head hurts.
i'll keep this open in the hopes that i remind myself to report after some production tests.
Thank you for your work on this. It is quite a rabbit hole you've gone down. I agree that you should keep this issue open in hopes that a solution if found.
Cross-linking your work with the leveldb folks for posterity:
This issue has been automatically marked as stale because it has been inactive for 28 days. To reactivate the issue, simply post a comment with the requested information to help us diagnose this issue. If this issue remains inactive for another 7 days, it will be automatically closed.
This issue has been automatically closed due to inactivity. If you are still experiencing problems, please open a new issue.