Investigate use of NFKD
steveatinfincia opened this issue ยท 6 comments
BIP-0039 suggests it needs to be applied in two situations:
When generating the wordlists
The standard says this:
The wordlist can contain native characters, but they must be encoded in UTF-8 using Normalization Form Compatibility Decomposition (NFKD).
This should be taken care of because the wordlist in bip39-rs is from the BIP-0039 repo and has already been processed correctly.
When turning a mnemonic phrase into a seed
The standard says this:
To create a binary seed from the mnemonic, we use the PBKDF2 function with a mnemonic sentence (in UTF-8 NFKD) used as the password and the string "mnemonic" + passphrase (again in UTF-8 NFKD) used as the salt. The iteration count is set to 2048 and HMAC-SHA512 is used as the pseudo-random function. The length of the derived key is 512 bits (= 64 bytes).
We currently make no attempt to follow this and should.
I believe the unicode-normalization crate provides this as UnicodeNormalization:nkfd
.
I've been working on adding in NFKD normalization, need reliable test vectors in non-English languages. (I already have a Japanese set)
I found some in the NBitcoin project. NBitcoin/NBitcoin. https://github.com/MetacoSA/NBitcoin/tree/master/NBitcoin.Tests/data
Nice find @wigy-opensource-developer!
The tests there were generated with https://github.com/nym-zone/easyseed
Maybe this could be an interesting codefix: Not normalized input for Japanese phrases to test normalization: bip32JP/bip32JP.github.io@360c05a (I do not speak Japanese, so I would need to rely on these to make test vectors myself ๐ )