Incorrect test data

Question

Incorrect test data

realindiahotel opened this issue 10 years ago · 6 comments

Hello, @bip32JP you don't seem to respond to my comments on the closed pull request.

The test data you have listed is wrong. As we are building unit tests to test against the test data you supply, it is easy to see that your test data is invalid because it contains ideographic spaces.

As you can see in the BIP39 spec, the mnemonic sentence generated will always contain words separated by an ASCII space. I realise you are saying that people need to normalise the test data before using it but this is incorrect practice. Test data should ALWAYS be an exact match of what you expect the code to produce. If you leave that test data there with ideographic spaces and someone fails to normalise the test data and gets the code working to your test data that would be very bad.

Long story short, test data to be correct to the BIP39 spec can NEVER contain ideographic spaces as the code normalises all output. Therefore correct code will always fail your test data.

If you want to refute my claim, can you please show me the code in the reference client that outputs a mnemonic with ideographic spaces? Else please explain how it is possible for the reference code to produce data that matches your test output.

Answer 1 · 2014-12-20T10:24:49.000Z

If you run all vectors before comparing to mine, you will see that only the phrase should fail.

This is as expected, as my website, as well as the suggestion on the special exceptions, says to use ideographic spaces.

I have these in my vectors so that Japanese developers will notice this, and implement the recommendation.

My wordlist is in a phonetic character set in Japanese, so if you place them too close to one another, it is hard to tell where one word ends and another begins.

Seeing as these characters are wider than ASCII characters, using an ASCII space to separate them will have a high chance of confusing the user.

I HIGHLY recommend displaying, and generating, Japanese phrases using ideographic spaces.

However, as it is not required (seeing as ideographic spaces normalize into ASCII spaces, so either way, the same seed will be generated) I will not say "it is required" but I STRONGLY RECOMMEND it.

This will not be fixed, as I have purposefully used ideographic spaces for these reasons.

Answer 2 · 2014-12-20T11:18:53.000Z

You are correct, it is only the phrase that fails as the bytes and everything are accurate.

I don't think so, wouldn't Japanese developers see the wider spaces as normal? Given the specification is written in English, wouldn't it be fair to assume majority of developers will not be using input that uses ideographic spaces? You are deliberately placing incorrect test data in the hopes that a Japanese developer might notice the wider spaces? What your website says is irrelevant, I didn't even know there was a web site as a developer I came here following the spec documentation.

I'm going to have to release my own version of these tests that actually pass the mnemonic tests for anyone who actually wants to develop and doesn't want to have to worry about modifying test data as that is one of the biggest no nos out!

I'll call it Japanese test vectors with working mnemonic tests for non japanese developers or something.

Answer 3 · 2014-12-20T11:32:07.000Z

wouldn't it be fair to assume majority of developers will not be using input that uses ideographic spaces?

If you are coding something that generates a Japanese phrase (which you said you are doing)

USE IDEOGRAPHIC SPACES.

You don't need an input with the ideographic space. use unicode directly. Almost every single programming language has a method of representing a unicode / UTF-8 character without actually having to input it. u"\u3000" in python, for example.

The dangers of using ASCII space to separate the words is if the user gets two words that, when put together can make a bigger, maybe more common word. If they view that as one word, it will cause problems when restoring.

I will make a pull request to bips to remove the "suggestion" and make it a "requirement" if you want to push the issue further by making your own vectors.

Answer 4 · 2014-12-20T11:46:59.000Z

But the specification DISSALLOWS the use of ideographic spaces!!!! You are talking about when designing a UI. YES by all means when DISPLAYING the mnemonic, use u3000, this is exactly what I do! HOWEVER, the OUTPUT from the code developed to the spec, outputs normal space and so the test data needs to be comparable to what the specification is being output. It is up to the developer to swap u3000 from input and replace it on output, I remove u3000 by normalisation as per spec. You have provided test data that will fail on the output from the reference client....you can make it a requirement or what ever you like but the fact remains that your test data FAILS on the reference code.....do you really think it is ok to provide data test data that doesn't even pass when compared to the output from reference code? I should think not! Make it a requirement do what you like but if you don't fix it I AM FORCED to supply test data that PASES against the reference code without requiring manipulation first.

Answer 5 · 2014-12-20T14:13:32.000Z

If you feel this way, remove my test vectors from the bitcoin/bips in a pull request, and let consensus decide.

I will argue against it there, though.

I am telling you that the Japanese wordlist can not be generated properly using the reference code as is.

If you think that means Japanese wordlist should be removed, then make a pull request and see what consensus everyone gets. Making a test vector for Japanese that works with an incorrect implementation of Japanese phrase generation will generate incorrect Japanese phrases.

Answer 6 · 2014-12-20T14:26:29.000Z

I don't want to see Japanese removed at all, the more languages the better obviously!

I know what you are saying, but what I am saying is the wordlist can be generated properly, just not in a form that is ideal for the UI. Unless the reference code is changed, it will always generate a mnemonic containing words separated by ASCII space. Now I realise your point that this is no good for Japanese UI, but it does not change the output from the code!

If you want to force ideographical spaces out for Japanese, you'd be better off changing the reference code such that it does this, but it is never ok to make a point through test data like this. The spec says that it has to provide a mnemonic separated by ASCII space, not suggested, it HAS to!!!

And if a Japanese mnemonic is supplied with ASCII spaces it will not result in lost funds because the ideographic should be normalised to ASCII spaces anyway so it makes no difference as long as normalisation is occurring as per the spec.

I don't think anything needs to be removed etc. I just think that test data has to be true to the specification and not mangled to suit a language/UI. If there needs to be changes to the spec to cater for Japanese language, change the spec and reference code, that way the code will then pass the tests and this is how it should be, test data ALWAYS matches expected output. If we go down the path of creating test data where people have to do things to get around the specification then what is the point of the specification?