paritytech/parity-scale-codec

Possible differences of the implementation compared to Polkadot's SCALE specification

Closed this issue · 4 comments

As a part of our efforts to rebuild the Substrate docs, I'm currently setting up a minipage about this Rust implementation of the SCALE codec. During my research I noticed a few curious things. My assumption is that the Polkadot specification of SCALE1 was used as a reference for this implementation. This might not be true, but certainly there must exist some sort of ground truth for all the different implementations of SCALE somewhere.

Finding: I noticed that the compact encoding of integers $\geq 2^{30}$ appears to differ from what is defined in the specification2. In this implementation of SCALE, for compact encodings of integers $\geq 2^{30}$ the upper six bits of the first byte are used to store the number of bytes to follow the first byte. More precisely, if we denote by $m$ the length of the LE encoding of the actual integer being encoded, then the upper six bits encode $m - 4$.

As an example we can look at the first relevant integer and its compact encoding:

  let x : u32 = u32::pow(2, 30);
  println!("{:02x?}", AsCompact(x).encode());

The output of this is [03, 00, 00, 00, 40]. Here the upper six bits of the first byte are equal to 0 and 0x03 results from the mode bitflag 11 appended as the least-signifcant bits. We have $m - 4 = 0 \iff m = 4$, which is exactly the number of bytes following as the LE encoding of $2^{30}$.

However, if I interpret the definitions of the specification correctly, then it differs from this implementation in that the byte used for encoding the length (=the first byte) is counted as well. This can be seen by looking at the definition of $Enc_{SC}^{Len}$ in [Def. 198]. There the byte array for the $2^{30} \leq n$ case is indexed as $k_1 k_2 ... k_m$, i.e. $m$ includes the first byte.
That $k_1$ is really the length-encoding byte can be seen by requirement, that
$$k^{7}_1 \cdots k^{3}_1 k^{2}_1 = m - 4.$$

Coming back to the example above, the desired output according to the specification would be [07, 00, 00, 00, 40], where 0x07 is the compact encoding of $m = 5$.

Now, this seems like a minor difference, and, in fact, I think the implementation here is better3, but that's beside the point. I'm just wondering if this could cause any harm, especially if other implementations of SCALE are encoding exactly as defined in the specification. I'm currently not knowledgeable enough to gauge the implications, so I wanted to share what I've found.

Footnotes

  1. https://spec.polkadot.network/id-cryptography-encoding#sect-scale-codec

  2. There exists a similiar issue with vector encodings, but since it is a bit more complicated, i will refrain from elaborating here until some sort of relevancy is established.

  3. My understanding is that the available 6 bits are used to encode m - 4 (instead of e.g. m) to ensure they are utilized as efficiently as possible. Not counting the first byte allows us to compact encode integers up to 2^{(63+4)8 - 1} = 2^{536} - 1.

bkchr commented

The spec isn't correct, no need to include the one byte that is always present in the encoding.

The spec isn't correct, no need to include the one byte that is always present in the encoding.

Thanks for your reply. This part I do understand. I just wondered if this can result in problems when this implementation is used in conjunction with others. Or did this Rust implementation itself serve as some sort of reference implementation for SCALE?

bkchr commented

The Rust implementation is the reference. The spec didn't exist before, at least not the one you mentioned.

Thank you!