Anagram exercise: unicode and case sensitivity

Question

Anagram exercise: unicode and case sensitivity

tadas-s opened this issue 3 years ago · 3 comments

Hello,

Not sure if it's right place to ask.

I'm a little confused about this test case:

static void test_unicode_anagrams(void)
{
   TEST_IGNORE();               // This is an extra credit test.  Delete this line to accept the challenge
   // These words don't make sense, they're just greek letters cobbled together.
   char inputs[][MAX_STR_LEN] = {
      "ΒΓΑ",
      "ΒΓΔ",
      "γβα"
   };

   char subject[] = { "ΑΒΓ" };

   candidates = build_candidates(*inputs, sizeof(inputs) / MAX_STR_LEN);
   enum anagram_status expected[] = { IS_ANAGRAM, NOT_ANAGRAM, NOT_ANAGRAM };

   find_anagrams(subject, &candidates);
   assert_correct_anagrams(&candidates, expected);
}

Third candidate "γβα", according to the test suite, is not an anagram of "ΑΒΓ". But, if I uppercase the candidate it's "ΓΒΑ". It's also not a case of visually similar characters - try this in your browser console:

> "αβγ".toUpperCase() == "ΑΒΓ"
> true

According some other tests anagram code should ignore the letter case.

Am I missing something? Or is utf8 capable solution is not expected to be case insensitive?

Cheers,
Tadas

Answer 1 · 2021-07-04T17:37:09.000Z

Hmm, that test case was added with the original commit, I think before anybody currently active joined - b9d352f

It's not listed as a test at all in the problem-specifications, so this is homegrown and probably unique to this track: https://github.com/exercism/problem-specifications/blob/main/exercises/anagram/canonical-data.json

That is probably an oversight. In fact, the example code doesn't do this correctly at all, since it looks at the individual chars and not the Unicode encoding. It's looking at anagrams of the individual bytes in the string which I highly doubt is valid.

I may not have context so I'd like another maintainer to weigh in, but if we can write this example with the std library without too much difficulty I say we try to fix it. Otherwise I'm happy just removing it. I haven't written utf8 compatible C code before, so I'm not sure what facilities exist.

Answer 2 · 2021-07-04T18:04:59.000Z

@patricksjackson I agree. If this is do-able without contortions with the standard library then we should fix it, otherwise we should just remove it.

Answer 3 · 2021-07-06T19:55:24.000Z

The non-ASCII cases were removed from the specification in exercism/problem-specifications#414 due to exercism/problem-specifications#413.
Additionally I think this is not possible with the standard library, e.g. tolower() only works on single char characters.
To whit, I suggest we likewise remove the tests here.
Will prepare a PR for this now.

Thanks for reporting @tadas-s