se-edu/addressbook-level2

Names with unicode characters are treated as invalid

Neurrone opened this issue · 10 comments

Noticed this happening in the last 2 test cases in input.txt. Think it might be caused by a letter being represented with 2 code points, in which case \p{L} wouldn't match.

I can reproduce on Windows only. runtests.bat fails for me.

Comparing files actual.txt and EXPECTED.TXT
***** actual.txt
|| Enter command: || [Command entered:  add Björn Borg p/98765432 e/borg@gmail.com a/311, Clementi Ave 2, #02-25]
|| Person names should be spaces or alphabetic characters
|| ===================================================
***** EXPECTED.TXT
|| Enter command: || [Command entered:  add Björn Borg p/98765432 e/borg@gmail.com a/311, Clementi Ave 2, #02-25]
|| New person added: Björn Borg Phone: 98765432 Email: borg@gmail.com Address: 311, Clementi Ave 2, #02-25 Tags: 
|| ===================================================
*****

***** actual.txt
||      3. John-Doe jr. Phone: 98765432 Email: johnd@gmail.com Address: 311, Clementi Ave 2, #02-25 Tags: 
|| 
|| 3 persons listed!
|| ===================================================
***** EXPECTED.TXT
||      3. John-Doe jr. Phone: 98765432 Email: johnd@gmail.com Address: 311, Clementi Ave 2, #02-25 Tags: 
||      4. Björn Borg Phone: 98765432 Email: borg@gmail.com Address: 311, Clementi Ave 2, #02-25 Tags: 
|| 
|| 4 persons listed!
|| ===================================================
*****

***** actual.txt
|| Enter command: || [Command entered:  add José Eduardo p/98765432 e/borg@gmail.com a/311, Clementi Ave 2, #02-25]
|| Person names should be spaces or alphabetic characters
|| ===================================================
***** EXPECTED.TXT
|| Enter command: || [Command entered:  add José Eduardo p/98765432 e/borg@gmail.com a/311, Clementi Ave 2, #02-25]
|| New person added: José Eduardo Phone: 98765432 Email: borg@gmail.com Address: 311, Clementi Ave 2, #02-25 Tags: 
|| ===================================================
*****

***** actual.txt
||      3. John-Doe jr. Phone: 98765432 Email: johnd@gmail.com Address: 311, Clementi Ave 2, #02-25 Tags: 
|| 
|| 3 persons listed!
|| ===================================================
***** EXPECTED.TXT
||      3. John-Doe jr. Phone: 98765432 Email: johnd@gmail.com Address: 311, Clementi Ave 2, #02-25 Tags: 
||      4. Björn Borg Phone: 98765432 Email: borg@gmail.com Address: 311, Clementi Ave 2, #02-25 Tags: 
||      5. José Eduardo Phone: 98765432 Email: borg@gmail.com Address: 311, Clementi Ave 2, #02-25 Tags: 
|| 
|| 5 persons listed!
|| ===================================================
*****

I think the main problem is that FC does not support UTF-8. With the /U switch, it does support UCS2-LE-BOM, but Java writes UTF-8 to stdout for some reason.

Without the /U switch, it treats the files as ASCII, hence these comparison differences.
edit: totally wrong, sorry too sleepy. time to hit the bed.

Okay, here's a fix:

diff --git a/test/runtests.bat b/test/runtests.bat
index 067981b..6e243e8 100644
--- a/test/runtests.bat
+++ b/test/runtests.bat
@@ -10,7 +10,7 @@ REM compile the code into the bin folder
 javac  -cp ..\src -Xlint:none -d ..\bin ..\src\seedu\addressbook\Main.java

 REM run the program, feed commands from input.txt file and redirect the output to the actual.txt
-java -classpath ..\bin seedu.addressbook.Main < input.txt > actual.txt
+java -Dfile.encoding=UTF-8 -classpath ..\bin seedu.addressbook.Main < input.txt > actual.txt

 REM compare the output to the expected output
 FC actual.txt expected.txt
\ No newline at end of file

More a workaround though, since file.encoding is an implementation detail of Oracle Java.

References:

Nice 👍 @pyokagan can you go ahead and submit the fix? Perhaps with an additional comment in the batch file to explain why the -D switch is used.

@damithc

can you go ahead and submit the fix? Perhaps with an additional comment in the batch file to explain why the -D switch is used.

Oh whoops, instead of "here's a fix", I should have said "here is something that just stops runtests.bat from complaining -- it may not actually be a correct solution".

I currently only have a hazy idea of the problem, but it likely has something to do with how we feed Java on Windows UTF-8 input, and Java decoding stdin as some kind of single-byte encoding (maybe latin1), thus decoding them to individual separate unicode codepoints, then somehow encoding them back when printing to stdout such that the output is valid UTF-8 as well.

FC probably does the same thing in ASCII mode, which is why it also just happens to work.

The -Dfile.encoding=UTF-8 solution seems that it would work nicely for Java as long as students are running Oracle's Java (I'm having a hard time thinking of any other viable installation), but I'm still slightly concerned about FC -- especially how it handles non-ASCII bytes.

Maybe this is a "let's try this solution out and we'll check back again when it breaks" kind of scenario.

Understood @pyokagan
Yes, it's fine to have a stop-gap measure until we find a better solution.

Do we restrict the Name of the Person, i.e. only English accepted but no other language characters ?

Do we restrict the Name of the Person, i.e. only English accepted but no other language characters ?

@pyokagan what do you think?

Do we restrict the Name of the Person, i.e. only English accepted but no other language characters ?

Yeah, I think we should just do that (for both the implementation and the tests).

Yeah, I think we should just do that (for both the implementation and the tests).

We will just need to revert #43 in that case. (cc @okkhoy @ndt93 )