tikv/sig-transaction

How to change write CF?

Closed this issue · 7 comments

nrc commented

Constraints:

  • backwards compatibility including rolling updates
  • distinguish between rollback and write where the rollback's start_ts is the same as the write's commit_ts
nrc commented

I think rather than change write CF, we can change timestamps. Since write CF keys are of the format encode(key), ts and encode(key) is terminated by a recognizable sequence of bytes, we can make ts variable length. If ts is 8 bytes, we treat it as an 'old' timestamp, and if its more, we treat it as a new timestamp. So backward compatibility is possible. The hard case is if we have mixed nodes, and a new node writes a new timestamp and an old node reads it, then we might not do the right thing. We can probably guard against this by only using new timestamps when we know that all nodes have upgraded.

As for the new timestamp format, I have an idea where the the first 8 bytes are a local ts (more on what that means later), then there is 7 byte node id (or PD generated = 0), node ids are assigned by PD when a node starts up (we probably already have a node id somewhere already) and 1 byte as a spec number to be future compatible.

Timestamps would still be globally unique, although only the upper 8 bytes are monotonic with time. So there is a monotonic partial order and a non-monotonic total order.

@sticnarf suggested in Slack an 8 byte PD timestamp and an 8 byte local version number.

backwards compatibility including rolling updates

I don't think it's a problem because the order of rolling updates is PD, TiKV and then TiDB. Only when TiDB is upgraded, it's possible the timestamp conflict occurs.

nrc commented

I don't think it's a problem because the order of rolling updates is PD, TiKV and then TiDB. Only when TiDB is upgraded, it's possible the timestamp conflict occurs.

Nice.

nrc commented

My preference is to start by trying solution 1, write priority. I believe it is sound and is certainly the easiest to implement. It might affect performance slightly, but lets benchmark to find out.

Sorry I overlooked this issue when I was working on tikv/tikv#8349 . But I used to think about this solution. In my opinion changing key format in write cf will be harder than we expected. Code that depends on the current key format is every where. Even TiFlash and CDC will be affected and need to do something to follow the change. Also, TiKV uses a prefix extractor for rocksdb to build bloomfilter, and it simply truncates the last 8 bytes. If the timestamp part can be either 8 or 16 bytes, the logic of the prefix extractor will be more complicated. Maybe it need to decode the key part to find out the actual length of the timestamp part. Without the prefix extractor the bloom filter cannot work when looking up a specified user key.

Considering these difficulties, I finally decided to try the Rollback Flag solution first in tikv/tikv#8349, which has much less affects.

nrc commented

Closing, since I think this is solved by 8349