Inconsistent alphabets across languages
kschulst opened this issue ยท 7 comments
Hi! It seems that the default base62 alphabet is defined differently across of languages:
Java implementation:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Ref: https://github.com/mysto/java-fpe/blob/main/src/main/java/com/privacylogistics/FF3Cipher.java#L515
Python implementation:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Ref: https://github.com/mysto/python-fpe/blob/main/ff3/ff3.py#L75
This will lead to inconsistent results between implementations if the user does not define the alphabet explicitly.
Hi Kenneth, thanks for detailing this issue with the base62 alphabet. It's unfortunate that these two implementation have reversed the order here.
The solution is probably to externalize the test vectors as yaml or json and share them. However, aligning the order will break previously encoded data for the package which changes. I'll look into this as the benefits of shared test vectors is significant (in addition to correcting this alignment issue).
If the alphabets will be aligned across of languages, what would be the preferred definition of the base62 alphabet?
digits+uppercase+lowercase
or digits+lowercase+uppercase
?
Using the ordinal value from the ascii table as a pointer, that would suggest that digits+uppercase+lowercase
could be the reference definition. On the other hand, if other implementations already utilise alphabets such as 0123456789abcdef
, the digits+lowercase+uppercase
option could be regarded as an extension of such alphabets.๐ค
It would be nice to know, so that I my implementation can settle for a good "default override" alphabet.
Yes, ASCII would suggest digits+uppercase+lowercase.
However, the NIST test vectors for FPE use digits+lowercase and do not use uppercase in their plaintext / ciphertext. For example, Sample #5. Not this affects only the test cases, the alphabet ordering is not part of the standard.
FF3-AES128
Key is EF 43 59 D8 D5 80 AA 4F 7F 03 6D 6F 04 FC 6A 94
Radix = 26
--------------------------------------------------------------
PT is <0123456789abcdefghi>
CT is g2pk40i992fn20cjakb
---------------------------
Implementations that support custom alphabets could handle this as a special case, but not all of the Mysto languages support custom alphabets at this point. My C implementation, for example, does not support it currently.
I may ask NIST why they chose digits+lowercase, instead of using the ASCII ordering. I'd say ASCII is the more traditional ordering.
Based on the NIST test vectors, and for compliancy with the other Mysto- FPE implementations, it seems that the safest bet is to go with a digits+lowercase+uppercase alphabet as default then?
To give you some context, we are working on a custom extension to the Google Tink library. Tink does not currently provide any cryptographic primitives for FPE, so we have created a custom FPE abstraction. Instead of implementing FPE from scratch, we are using the Mysto library family underneath the hood as a baseline implementation for FF3-1 across languages. Disclaimer: I am not associated with Google or Tink.
The alphabet of choice (such as ALPHANUMERIC
) is encoded into the Tink FPE key material, so it would be nice to be aligned with the defaults applied by the Mysto-libraries.
(It would be very interesting to discuss other usage aspects regarding the Mysto-library and get your opinion on the design decisions that we have made for Tink FPE (python and java version). But that is probably out of context for this issue ๐)
Kenneth,
After further thought, both Unicode and ASCII use the ordinal sort order of digits+uppercase+lowercase. It's sort of an accident of how NIST has defined the test vectors, which is unfortunate, but can be worked around.
To correct this in the Java version, I'll remove the support for radix 26 in the FF3Cipher(key, tweak, radix) constructor.
@kschulst please look at the latest trunk for java-fpe and see if this looks good to you. I plan to revise the Python and other implementations to use the general lexicographical ordering of digits, uppercase and then lowercase.
Totally agree with this approach ๐