twitter/twitter-cldr-rb

get_sort_key for special character sequence raises NoMethodError

Closed this issue · 7 comments

get_sort_key fails for one character sequence:

$ grep twitter_cldr Gemfile.lock 
    twitter_cldr (3.3.0)
collator = TwitterCldr::Collation::Collator.new(:en)
# => #<TwitterCldr::Collation::Collator:0x007f96f8a9b578
#  @locale=:en,
#  @options={},
#  @trie=#<TwitterCldr::Collation::TrieWithFallback:0x007f96f2b9c838>>
collator.get_sort_key("\u0450\u0D80")
# => NoMethodError: undefined method `combining_class' for nil:NilClass
# from ~/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/twitter_cldr-3.3.0/lib/twitter_cldr/collation/collator.rb:111:in `explicit_collation_elements'

Hey @phorsuedzie, thanks for the bug report. It looks like the character \u0D80 is an unassigned Unicode codepoint as of UCD v6.3.0, so TwitterCLDR has no information for it:

TwitterCldr::Shared::CodePoint.get("D80".to_i(16))  # => nil

Where is this character coming from in your code? It's possible a newer version of the UCD (Unicode Character Database) defines it, but we're stuck on v6.3.0 for the time being.

The character sequence is from an automatically generated text version of a chinese PDF. But this text looks more like "pure garbage", e.g. no chinese characters, a lot of \f and I wouldn't expect "\u450" be included there either.

That leaves some questions:

  • Is get_sort_key considered stable for an "all assigned codepoint character" string? Or will this value probably change when there are new codepoints introduced?
  • Is the behaviour of get_sort_key in case of a string containing an unassigned character "stable failure" (fails for every unassigned character), or some sort of random (fails for some characters, returns a value for others, and that value might change in the future)?
  • Is there any way to "sanitize" the input, e.g. remove characters that get_sort_key cannot handle? I could probably do that myself, and just call get_sort_key repeatedly with suspicious characters removed, but a way with less calls would be even better.

Ok, that's a pretty interesting use case. Where does all the garbage come from?

  • Yes, get_sort_key should be stable for all assigned/defined code points in UCD v6.3.0. TwitterCLDR uses the collation weights defined in the UCD and CLDR data sets, which get updated in lockstep.

  • I don't know that I can say get_sort_key will fail when encountering every single unassigned/undefined character because that behavior isn't defined by CLDR or Unicode (that I know of). The fact that you're seeing this NoMethodError is probably a mistake. We should raise a more specific error to handle this case explicitly, which would make it a "stable failure."

  • You should be able to sanitize the input with a Unicode regexp, although I will admit I don't know if that will catch absolutely everything. The idea is to remove any character without a Unicode "general" property value assigned to it. If it has no general category, chances are the character isn't defined/assigned. Your mileage may vary:

    re = TwitterCldr::Shared::UnicodeRegex.compile("[^[:C:][:L:][:M:][:N:][:P:][:S:][:Z:]]")
    "\u0450\u0D80".gsub(re.to_regexp, '')  # => "\u0450"

Thank you for your help.

Where does all the garbage come from?

Output of pdftotext on "foreign input data". There was only this single hickup within all first 50 characters of all PDFs we've processed (about 300k).

Huh, very weird. Sounds like a bug 🐞

Maybe it was even garbage to pdftotext (which it failed to refuse) and no "PDF bytes" at all - just some bytes copied from /dev/urandom into "chinese-garbage.pdf". You never know what gets uploaded :-)

As of v4.0.0, Invalid codepoints cause the collator to raise a TwitterCldr::Collation::UnexpectedCodePointError.