How to compatible with data schema changes?

Question

How to compatible with data schema changes?

jun0315 opened this issue a year ago · 7 comments

Due to my previous mistake, I did not pay attention to what was mentioned in readme:

As a result, Bincode is suitable for storing data. Be aware that it does not implement any sort of data versioning scheme or file headers, as these features are outside the scope of this crate.

I used bincode encoding in the metadata of the database, databendlabs/databend#11015 such as (segment snapshot). In the future, our database needs to add or remove metadata fields. For example, from v3.0->v3.1, some new fields have been added, and users cannot directly read the data written in v3.0 using v3.1's reading code, resulting in data incompatibility. Do you have any good ideas? I apologize again for the issue caused by my mistake.

Answer 1 · 2023-05-25T17:21:12.000Z

Keep the old type definitions of the metadata around and add a small header or footer to your segment files which can be checked validity, e.g. it has a fixed size and contains version information and a checksum of the version information.

When loading a segment file, load the fixed size header and try the checksum. If it matches, use the version information to determine the correct metadata type definitions to deserialize the payload. If it does not match (or the file was too small), then this segment was written before the first schema with version information (or the file was actually corrupt) and you again know which type definitions to use for deserialization.

Past that, you just need some code which takes the old type definition and turns them into instances of the new types.

Answer 2 · 2023-05-26T02:16:37.000Z

Thanks a lot! @adamreichold
Your idea is quite similar to our current version control. Our problem is that after the structure defined in v3, it cannot be modified, similar to freezing.
Our current idea is to rewrite all the structures included in v3 for future v4 versions, but this will require a lot of work and the codebase will become less elegant.

Answer 3 · 2023-05-26T05:17:13.000Z

Instead of rewriting your structures again for v4, switch to a serialization format supporting schema evolution for v4? There is no rule that your segment files must always contain bincode-serialized data, this could change based on the header as well.

Answer 4 · 2023-05-27T09:46:31.000Z

There's 2 ways that come to mind to make a bincode schema forwards-compatible (and I should really properly document this somewhere).

always end your structs with an Option<()> and make this None. This way bincode will write a 0 byte. In the future you can replace this Option<()> with Option<Continuation>, e.g.:

struct HeaderV1 {
    length: usize,
    name: String,
    continuation: Option<HeaderV2>
}

struct HeaderV2 {
   readable: bool,
   writable: bool
}

use enums

Serde (and thus bincode) will serialize enum variants based on their index, e.g.:

enum Foo {
   Bar, // index 0, will be serialized as a 0
   Baz, // 1
   // etc
}

This is the reason you can't reorder fields in enums without making breaking changes.

However it also means that you can append new variants and it will be able to read this

enum Foo {
   Bar, // index 0, will be serialized as a 0
   Baz, // 1
   NewVariant, // 2
}

The previous version will never write a 2, therefor it should be safe to make this change.

Unfortunately the way bincode works, it sounds like your situation will be a little harder. I'd suggest looking at your entry object that you're serializing and figure out a value that makes no sense. E.g. if your first field is a string, bincode will serialize it's length as a usize. For the new format you can start by serializing usize::MAX as this shouldn't be a valid length for your string in the older version.

We have a mostly complete docs/spec.md that may also give you more ideas on how the bincode format works.

Hope this helps

Answer 5 · 2023-05-27T10:18:46.000Z

Unfortunately the way bincode works, it sounds like your situation will be a little harder. I'd suggest looking at your entry object that you're serializing and figure out a value that makes no sense. E.g. if your first field is a string, bincode will serialize it's length as a usize. For the new format you can start by serializing usize::MAX as this shouldn't be a valid length for your string in the older version.

I fear that in the case described above which uses bincode 1.3.3, this would lead to crashes as the deserializer would attempt to allocate that string?

Answer 6 · 2023-05-27T12:25:14.000Z

If you try to deserialize a 3.1 file version with a 3.0 application, yes it would try to allocate that string

Answer 7 · 2023-07-26T21:55:41.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.