jagobagascon/Natural-Sorting-for-Java

Inconsistent sorting behavior on non-ASCII digits

saharNooby opened this issue · 5 comments

Consider following test case:

@Test
public void testNonAsciiDigits() {
	String fullWidthDigitOne = "1";
	String fullWidthDigitFive = "5";
	String fullWidthDigitSix = "6";

	testSort(Arrays.asList(fullWidthDigitOne, "2", fullWidthDigitFive, "9",
			fullWidthDigitOne + fullWidthDigitFive, "1" + fullWidthDigitSix, fullWidthDigitOne + "7"));
}

It currently fails:

Actual list:      [2, 9, 1, 5, 16, 17, 15]
Expected list:    [1, 2, 5, 9, 15, 16, 17]

Current code detects full width digits as decimal digits (Character.isDigit returns true for them), but then fails to compare digit sequences as numbers because it compares digits by their char code, not value.

There are two ways to fix this:

  1. Make digit comparison work using Character.digit. That way numbers containing non-ASCII digits will be compared logically. This will slightly impact performance.
int d1 = Character.digit(c1, 10);
int d2 = Character.digit(c2, 10);
if (d1 != d2) {
	res = d1 - d2;
}
  1. Replace Character.isDigit check with simple c >= '0' && c <= '9'. That way non-ASCII digits will be treated as any other non-digit character, and they won't be compared as numbers. People using non-ASCII digits may be confused by this behavior.

I personally would prefer first solution, since Unicode digits are valid digits.

After decision is made, I can do a pull request.

Reference: list of non-ASCII digits in Unicode: https://www.fileformat.info/info/unicode/category/Nd/list.htm

As as fun side note, WinAPI's StrCmpLogicalEx completely messes up the list:
[1, 16, 17, 2, 5, 9, 15]

It is true that the algorithm is failing for characters other than the basic latin digits, and that's because the comparator uses the 0 character as a filler:

So a simple test like this is not working right now.

// Full Width
testSort(Arrays.asList("0", "1", "01", "001", "10", "010", "0010", "00010"));
// Thai
testSort(Arrays.asList("๐", "๑", "๐๑", "๐๐๑", "๑๐", "๐๑๐", "๐๐๑๐", "๐๐๐๑๐"));

This library was never meant to support digits other than the latin 0-9. But I'm open to include the fix number 1 to increase the supported languages.

That would allow mixing the digits from several languages, which I don't really like, if instead of full width digits we use the thai digits we would get something like this:

Actual list:      [2, 9, ๑, ๕, 1๖, ๑7, ๑๕]
Expected list:    [๑, 2, ๕, 9, ๑๕, 1๖, ๑7]

The expected list makes no sense at all.

The library was intended to sort strings in a more human friendly way, and I don't think there's anything friendly about having somenthing like that.

I don't think there's anything friendly about having somenthing like that

There are 2 possible situations after fix 1 implementation:

  1. non-English user sorts something containing numbers in their native language and sees expected result.
  2. someone mixes digits from different languages and sets, sorts these strings and gets nonsense.

I think benefit from (1) is greater than harm from (2), especially considering probability of both cases: there defenitly will be users who use non-ASCII digits from their language, but I feel that no one would mix digits to achieve something useful (not by accident).

Yes, I absolutely agree.

I just wanted to make clear that whatever change is made to the library is for non latin alphabets (1), and that mixed digits (2) sorting behaviour would just be a collateral consequence of that.

I think it would be useful to add a disclaimer to README, something like "it works for non-ASCII digits, but treats numbers consisting of mixed digits as a whole".

Can I start implementing the fix then?

Yes, that would be helpful! Thanks.