DemoStorage does not take whiteouts into account -> leading to data corruption
Opened this issue · 22 comments
Hello up there. I've discovered that DemoStorage wrongly handles load for objects deleted in read-write overlay leading to data corruption. It can be fixed if we correct interface for IStorage.loadBefore
. Please find details in example program provided below.
Kirill
/cc @jimfulton, @hannosch
---- 8< ---- (bug-demostorage-delete)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Program bug-demostorage-delete demonstrates that DemoStorage wrongly handles
load for objects deleted in read-write overlay:
If object is deleted, even though a whiteout is recorded in the overlay,
loadBefore can incorrectly return object data of previous object revision from
read-only base, leading to data corruption.
Example output::
base:
@03d9186bbe8b5188 obj<0000000000000001> -> int(0)
@03d9186bbe8ce3ee obj<0000000000000001> -> int(1)
base+overlay:
@03d9186bbe8feddd obj<0000000000000001> -> int(2)
@03d9186bbe9147bb root['obj'] = ø
GC:
@03d9186bbe931b00 del obj<0000000000000001>
remount base+overlay:
@03d9186bbe8feddd obj<0000000000000001> -> int(2) # must be int(2)
@03d9186bbe9147bb obj<0000000000000001> -> int(2) # must be int(2) (garbage, but still in DB)
@03d9186bbe931b00 obj<0000000000000001> -> int(1) # <- WRONG: must be POSKeyError, not int(1) from base
The bug could be fixed if we change IStorage.loadBefore(oid) to return
(None, serial) # serial = revision where object was deleted
instead of just
None
for deleted records.
Please see https://lab.nexedi.com/kirr/ZODB/commit/a762e2f8 for details.
"""
from __future__ import print_function
from ZODB.FileStorage import FileStorage
from ZODB.DemoStorage import DemoStorage
from ZODB.Connection import TransactionMetaData
from ZODB import DB
from persistent import Persistent
import transaction
import os, os.path
class PInt(Persistent):
def __init__(self, i):
self.i = i
def __str__(self):
return "int(%d)" % self.i
def h(tid):
return tid.encode('hex')
def dump(obj):
print("@%s obj<%s> -> %s" % (h(obj._p_serial), h(obj._p_oid), obj))
def rmf(path):
if os.path.exists(path):
os.unlink(path)
def fs1clear(datafs):
for suffix in ('', '.index', '.lock', '.tmp'):
rmf(datafs + suffix)
# mkbase prepares base storage.
def mkbase():
print("base:")
fs1clear("base.fs")
zbase = FileStorage("base.fs")
db = DB(zbase)
conn = db.open()
root = conn.root()
root['obj'] = obj = PInt(0)
transaction.commit()
dump(obj)
obj.i += 1
transaction.commit()
dump(obj)
conn.close()
db.close()
zbase.close()
zbase = FileStorage("base.fs", read_only=True)
return zbase
def main():
# prepare base + overlay
zbase = mkbase()
fs1clear("overlay.fs")
zoverlay = FileStorage("overlay.fs")
zdemo = DemoStorage(base=zbase, changes=zoverlay)
print("\nbase+overlay:")
db = DB(zdemo)
conn = db.open()
root = conn.root()
obj = root['obj']
oid = obj._p_oid
obj.i += 1
# modify root as well so that there is root revision saved in overlay that points to obj
root['x'] = 1
transaction.commit()
dump(obj)
atLive = obj._p_serial
# delete obj from root making it a garbage
del root['obj']
transaction.commit()
atUnlink = root._p_serial
print("@%s root['obj'] = ø" % h(atUnlink))
# unmount DemoStorage
conn.close()
db.close()
zdemo.close() # closes zbase and zoverlay as well
del zbase, zoverlay
# simulate GC on base+overlay
print("\nGC:")
zoverlay = FileStorage("overlay.fs")
txn = transaction.get()
txn_meta = TransactionMetaData(txn.user, txn.description, txn.extension)
zoverlay.tpc_begin(txn_meta)
zoverlay.deleteObject(oid, atLive, txn_meta)
zoverlay.tpc_vote(txn_meta)
atGC = zoverlay.tpc_finish(txn_meta)
print("@%s del obj<%s>" % (h(atGC), h(oid)))
# remount base+overlay
print("\nremount base+overlay:")
zbase = FileStorage("base.fs", read_only=True)
zdemo = DemoStorage(base=zbase, changes=zoverlay)
db = DB(zdemo)
def dumpObjAt(at, comment):
conn = db.open(at=at)
obj = conn.get(oid)
print("@%s obj<%s> -> %s\t# %s" % (h(at), h(oid), obj, comment))
conn.close()
dumpObjAt(atLive, "must be int(2)")
dumpObjAt(atUnlink, "must be int(2) (garbage, but still in DB)")
dumpObjAt(atGC, "<- WRONG: must be POSKeyError, not int(1) from base")
if __name__ == '__main__':
main()
The bug could be fixed if we change IStorage.loadBefore(oid) to return
(None, serial) # serial = revision where object was deleted
instead of just
None
for deleted records.
Perhaps I misunderstand, but wouldn't that require (infinitely) storing the history of deletions, or at the very least storing an abbreviated version of it? Wouldn't that thus largely defeat a lot of the point of packing a storage? If not that, then it requires new auxiliary data structures to track it that are "packed" separately; e.g., for FileStorage, a whole new file. That doesn't seem particularly feasible.
Likewise, it would also require substantial redesign for RelStorage to be able to provide that information in either history-preserving or history-free mode.
@jamadden, thanks for feedback. Maybe I myself too misunderstand something, but I think there is no need for additional side history: if deletion record is not yet packed - we have its serial. If deletion record is covered by pack and removed - it is ok to return 0 as serial from loadBefore, i.e. as if the object was "not yet created and later removed".
In other words for deletion records it is the same situatation as if it was just a plain data record - if that record is still covered by database, we have it in the history and take into account. If it is not covered - the database is operated as if there is no previous record for object in question.
For FileStorage this should work well without any auxiliary file because FileStorage records object deletion via its regular DataHeader
with both data_len
and back_pos
being set to 0:
https://github.com/zopefoundation/ZODB/blob/a89485c1/src/ZODB/FileStorage/format.py#L57-L75
https://github.com/zopefoundation/ZODB/blob/a89485c1/src/ZODB/FileStorage/FileStorage.py#L639-L641
About RelStorage - I quickly looked, and, if I understood correctly it is:
- first phase of
loadBefore
is implemented by something like
SELECT (state, tid) FROM object_state WHERE
zoid == <oid> AND
tid < <tid>
ORDER BY tid, DESC
LIMIT 1
https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/storage/load.py#L245-L246
https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/adapters/mover.py#L148-L175
- for history-preserving mode
deleteObject
is currently implemented by something like
UPDATE object_state
SET state = NULL,
state_size = 0,
md5 = ''
WHERE zoid = <oid>
AND tid = <oldserial>
https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/adapters/packundo.py#L377-L384
https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/adapters/packundo.py#L258-L263
which could be instead changed to be something like
INSERT INTO object_state
STATE=NULL,
state_size=0,
zoid=<oid>,
prev_tid=<old_serial>,
...
i.e. adding next data record that indicates object deletion, instead of removing object state for object record with oldserial. This will maintain all information that loadBefore needs to correctly return deletedAt serial for loadBefore query with at > old_serial
(i.e. before > old_serial+1
).
By the way - is that correct to remove data for record with oldserial like it is currently done? At that oldserial database view, object data has to be preserved and loadBefore with before=oldserial+1
should give live data, not "object deleted" even after object deletion:
https://github.com/zopefoundation/ZODB/blob/5.6.0-3-g22df1fd11/src/ZODB/interfaces.py#L1276-L1280
("This method marks an object as deleted via a new object revision...")
maybe I'm misunderstanding something though...
- for history-free mode
deleteObject
is implemented via
DELETE from object_state
WHERE zoid = <oid> AND
TID = <oldserial>
i.e. object data is removed inplace. And it is correct to do so: history-free mode can be perceived as mode where pack is run after every commit preserving data reachable only from latest database view. And if we imagine such pack after commit with deleteObject
it would have exactly that same effect - remove data from object_state table for deleted oid.
So, once again, maybe I'm missing something, but here I don't see any problem neither for RelStorage nor for traditional storages like FileStorage.
For the reference, here is symmetric proposal for loadBefore
that returns data: instead of returning
(data, serial, next_serial)
return only
(data, serial)
i.e. don't return next_serial at all.
This would unify return signature to always be (data, serial)
for both cases of returning data for existing object, and indicating that object was deleted/not-existed.
For FileStorage this should not make a difference, but for NEO and RelStorage this should offload loadBefore from performing additional SQL query which should improve loadBefore latency and speed things up. Here is relevant place in RelStorage with that additional SQL query to retrieve next_serial
that can be removed:
https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/storage/load.py#L259-L264
https://github.com/zodb/relstorage/blob/3.1.1-1-ge7628f9/src/relstorage/adapters/mover.py#L177-L199
The reason I say that next_serial
can be removed is that in ZODB/py the only place, that I could find, where next_serial is used on client is in client cache, and that cache can be remade to work without using that next_serial
at all. In simple words whenever after
loadBefore(oid, at) -> (data, serial)
query, the cache can remember data
for oid in [serial, at]
range.
When invalidation message from server is received, cache entries, that had at == client_head
, are extended (at -> new_head
) for oids that are not present in invalidation message, while for oids that are present in invalidation message no such extension is done. This allows to maintain cache in correct state, invalidate it when there is a need to invalidate, and not to throw away cache entries that should remain live. This of course requires ZODB server to include both modified and just-created objects into invalidation messages (zopefoundation/ZEO#160).
For the record: some of my older notes regarding this topic from http://navytux.spb.ru/~kirr/neo.html#development-overview and other places go below.
Kirill
/cc @vpelletier, @jmuchemb, @arnaud-fontaine
---- 8< ----
A note goes right away that by insisting on specifying ZODB/go interfaces right, a performance bug in current ZODB/py was discovered: in ZODB/py loadBefore
, in addition to serial
, also returns serial_next
, which constraints storage implementations unnecessarily and is used only in client cache. In ZODB/go client cache implementation shows that it is possible to build efficient client cache without serial_next
returned from Load and for this reason in ZODB/go Load interface specification comes without specifying serial_next
return . This in turn paves the way for how even NEO/py could be reworked to use 2x less SQL queries for a load object request.
Currently for a GetObject request NEO performs 2 SQL queries - one to load object data and another one to get next_serial for it:
(mariadb backend)
https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L484-508
https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/mysqldb.py#L477-482
(sqlite backend)
https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/sqlite.py#L328-349
https://lab.nexedi.com/nexedi/neoppod/blob/fb746e6b/neo/storage/database/sqlite.py#L322-326
I originally wanted to remove next_serial from load return for interface clarity, but it turns out it also reduces load on ZODB storage server and in particular can reduce 2x the number of SQL queries.
Patch for loadAt
that implements semantic proposed in #318 (comment) is here: #323.
For the reference - see also https://lab.nexedi.com/kirr/neo/commit/4fb6bd0a?expanded=1 which proides DemoStorage-analogue for ZODB/go with fix for this particular issue.
This issue about data corruption is here for almost a year.
The patch, that fixes this data curruption, is also there for almost a year: #323.
~ 2 months ago that patch was decoupled from a preparatory step that turned out to be questionable (#323 (comment)).
Could we please move on with getting this issue fixed?
I will take a look in the next few days.
Thanks, @d-maurer.
When you need a review, you can use a
github
workflow for the
request: you find the corresponding actions in the righ column
of thegithup
pages named "reviewer".
Thanks for the advice. Just for the reference: for loadAt
I've already asked some time ago via this way for feedback from @jamadden and @jimfulton: #323 (comment).
Jim is currently not interested in any ZEO/ZODB stuff, as his day job has nothing to do with these.
I hope I got this right. I met him at April 2021 Zope sprint.
I was just going to say the same thing. In a conference call a couple weeks ago Jim made it clear that he isn't using ZODB for anything right now, and he won't spend time on it anymore. Don't wait for his feedback, it will not come.
I've been very busy on other things but hope to find time to examine these PRs in depth "soon."
Thanks a lot for feedback, everyone.
It is sad to hear about Jim leaving ZODB land, but at the same time it helps to have it clearly stated.
Jim, being missing here, feels, of course, and a lot.
@jamadden, thanks for the update. I understand we are all busy, and, due to that, in practice "soon" might turn out somewhat distant. Still it is very helpful to read your feedback, because it helps to keep up the motivation with knowing something as "this topic is not orphaned; people are just busy, but still they are planning to have a look". So, thanks a lot, again, for this.
Kirill
P.S. Did maybe @tseaver somehow somewhere provided a similar update about his status? At Pylons/zodburi#29 I'm struggling waiting for review feedback and it is not coming despite many ping attempts. And, maybe it's just a coincidence, but this days I see Jim and Tres being active at the same places on github...
P.S. Did maybe @tseaver somehow somewhere provided a similar update about his status? At Pylons/zodburi#29 I'm struggling waiting for review feedback and it is not coming despite many ping attempts.
I am sure the situation is similar: Tres (and Jim) are listed as maintainers on many packages that they haven't been using in years. They have moved on to other work and other interests.
@dataflake, thanks for feedback on this.
The bug could be fixed if we change IStorage.loadBefore(oid) to return (None, serial) # serial = revision where object was deleted instead of just None for deleted records.
When loading after the deleted tid, NEO raises POSException.POSKeyError
. Any comment about this ?
Here is interface specification for loadBefore
:
Lines 710 to 724 in 1fb097b
My reading of that text is that storage should return None
when accessing an object past its deletion:
If there isn't data before the object before the given
transaction, then None is returned, otherwise three values are
returned:
...
If the object id isn't in the storage, then POSKeyError is raised.
And this interpretation coincides with actual FileStorage behaviour.
That means that here NEO is not following IStorage
specification.
Anyway, what really matters for the fix is that in addition to "deleted"
indicator (be it None
or POSKeyError
), the serial
of the deletion is also
provided.
I understand well the need to distinguish between the 2 cases. But I wondered at least whether I'd have something to fix in NEO, which surprised me because that's something on which I spent time. I've just checked the code of FileStorage and it behaves like NEO: since #64, a deletion results in POSKeyError, contrary to what the initial comment states. What's even less clear is that returning None for a deleted record (along with changes in DemoStorage) would actually fix this issue: of course, I admit it would not be ideal.
Ok, it was my mistake in #318 (comment) with misunderstanding about how FileStorage behaves. Here is the proof that FileStorage indeed raises POSKeyError after object deletion, but returns None before it was created:
---- 8< ---- (z.dump)
txn 0000000000000001 " "
user ""
description ""
extension ""
obj 0000000000000000 4 null:00
ROOT
txn 0000000000000002 " "
user ""
description ""
extension ""
obj 0000000000000001 5 null:00
DATA1
txn 0000000000000003 " "
user ""
description ""
extension ""
obj 0000000000000001 5 null:00
DATA2
txn 0000000000000004 " "
user ""
description ""
extension ""
obj 0000000000000001 delete
txn 0000000000000005 " "
user ""
description ""
extension ""
obj 0000000000000001 8 null:00
UNDELETE
txn 0000000000000006 " "
user ""
description ""
extension ""
obj 0000000000000001 delete
(neo) (z-dev) (g.env) kirr@deco:~/src/wendelin/z/zodbtools/t$ zodb restore x.fs <z.dump
0000000000000001
0000000000000002
0000000000000003
0000000000000004
0000000000000005
0000000000000006
(neo) (z-dev) (g.env) kirr@deco:~/src/wendelin/z/zodbtools/t$ fs1 dump x.fs
Trans #00000 tid=0000000000000001 time=1900-01-01 00:00:00.000000 offset=27
status=' ' user='' description=''
data #00000 oid=0000000000000000 size=4 class=?.?
Trans #00001 tid=0000000000000002 time=1900-01-01 00:00:00.000000 offset=104
status=' ' user='' description=''
data #00000 oid=0000000000000001 size=5 class=?.?
Trans #00002 tid=0000000000000003 time=1900-01-01 00:00:00.000000 offset=182
status=' ' user='' description=''
data #00000 oid=0000000000000001 size=5 class=?.?
Trans #00003 tid=0000000000000004 time=1900-01-01 00:00:00.000000 offset=260
status=' ' user='' description=''
data #00000 oid=0000000000000001 class=undo or abort of object creation
Trans #00004 tid=0000000000000005 time=1900-01-01 00:00:00.000000 offset=341
status=' ' user='' description=''
data #00000 oid=0000000000000001 size=8 class=?.?
Trans #00005 tid=0000000000000006 time=1900-01-01 00:00:00.000000 offset=422
status=' ' user='' description=''
data #00000 oid=0000000000000001 class=undo or abort of object creation
---- 8< ---- x.py
#!/usr/bin/env python
from __future__ import print_function
from ZODB.FileStorage import FileStorage
from ZODB.utils import p64, u64
import traceback as tb
import sys
oid = p64(1)
zstor = FileStorage('x.fs', read_only=True)
for tid in (1,2,3,4,5,6):
print('load %s@%s -> ' % (u64(oid), tid), end='')
sys.stdout.flush()
try:
x = zstor.loadBefore(oid, p64(tid+1))
except:
tb.print_exc()
else:
print('%r' % (x,))
(neo) (z-dev) (g.env) kirr@deco:~/src/wendelin/z/zodbtools/t$ ./x.py
load 1@1 -> None
load 1@2 -> ('DATA1', '\x00\x00\x00\x00\x00\x00\x00\x02', '\x00\x00\x00\x00\x00\x00\x00\x03')
load 1@3 -> ('DATA2', '\x00\x00\x00\x00\x00\x00\x00\x03', '\x00\x00\x00\x00\x00\x00\x00\x04')
load 1@4 -> Traceback (most recent call last):
File "./x.py", line 16, in <module>
x = zstor.loadBefore(oid, p64(tid+1))
File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/FileStorage/FileStorage.py", line 582, in loadBefore
raise POSKeyError(oid)
POSKeyError: 0x01
load 1@5 -> ('UNDELETE', '\x00\x00\x00\x00\x00\x00\x00\x05', '\x00\x00\x00\x00\x00\x00\x00\x06')
load 1@6 -> Traceback (most recent call last):
File "./x.py", line 16, in <module>
x = zstor.loadBefore(oid, p64(tid+1))
File "/home/kirr/src/wendelin/z/ZODB/src/ZODB/FileStorage/FileStorage.py", line 582, in loadBefore
raise POSKeyError(oid)
POSKeyError: 0x01
DemoStorage comments seem to agree with my original understanding and the text of loadBefore interface:
Lines 220 to 229 in 1fb097b
but FileStorage behaviour was indeed changed in #64.