get_sort_key for special character sequence raises NoMethodError
Closed this issue · 7 comments
get_sort_key
fails for one character sequence:
$ grep twitter_cldr Gemfile.lock
twitter_cldr (3.3.0)
collator = TwitterCldr::Collation::Collator.new(:en)
# => #<TwitterCldr::Collation::Collator:0x007f96f8a9b578
# @locale=:en,
# @options={},
# @trie=#<TwitterCldr::Collation::TrieWithFallback:0x007f96f2b9c838>>
collator.get_sort_key("\u0450\u0D80")
# => NoMethodError: undefined method `combining_class' for nil:NilClass
# from ~/.rbenv/versions/2.2.4/lib/ruby/gems/2.2.0/gems/twitter_cldr-3.3.0/lib/twitter_cldr/collation/collator.rb:111:in `explicit_collation_elements'
Hey @phorsuedzie, thanks for the bug report. It looks like the character \u0D80
is an unassigned Unicode codepoint as of UCD v6.3.0, so TwitterCLDR has no information for it:
TwitterCldr::Shared::CodePoint.get("D80".to_i(16)) # => nil
Where is this character coming from in your code? It's possible a newer version of the UCD (Unicode Character Database) defines it, but we're stuck on v6.3.0 for the time being.
The character sequence is from an automatically generated text version of a chinese PDF. But this text looks more like "pure garbage", e.g. no chinese characters, a lot of \f
and I wouldn't expect "\u450"
be included there either.
That leaves some questions:
- Is
get_sort_key
considered stable for an "all assigned codepoint character" string? Or will this value probably change when there are new codepoints introduced? - Is the behaviour of
get_sort_key
in case of a string containing an unassigned character "stable failure" (fails for every unassigned character), or some sort of random (fails for some characters, returns a value for others, and that value might change in the future)? - Is there any way to "sanitize" the input, e.g. remove characters that
get_sort_key
cannot handle? I could probably do that myself, and just callget_sort_key
repeatedly with suspicious characters removed, but a way with less calls would be even better.
Ok, that's a pretty interesting use case. Where does all the garbage come from?
-
Yes,
get_sort_key
should be stable for all assigned/defined code points in UCD v6.3.0. TwitterCLDR uses the collation weights defined in the UCD and CLDR data sets, which get updated in lockstep. -
I don't know that I can say
get_sort_key
will fail when encountering every single unassigned/undefined character because that behavior isn't defined by CLDR or Unicode (that I know of). The fact that you're seeing thisNoMethodError
is probably a mistake. We should raise a more specific error to handle this case explicitly, which would make it a "stable failure." -
You should be able to sanitize the input with a Unicode regexp, although I will admit I don't know if that will catch absolutely everything. The idea is to remove any character without a Unicode "general" property value assigned to it. If it has no general category, chances are the character isn't defined/assigned. Your mileage may vary:
re = TwitterCldr::Shared::UnicodeRegex.compile("[^[:C:][:L:][:M:][:N:][:P:][:S:][:Z:]]") "\u0450\u0D80".gsub(re.to_regexp, '') # => "\u0450"
Thank you for your help.
Where does all the garbage come from?
Output of pdftotext
on "foreign input data". There was only this single hickup within all first 50 characters of all PDFs we've processed (about 300k).
Huh, very weird. Sounds like a bug 🐞
Maybe it was even garbage to pdftotext
(which it failed to refuse) and no "PDF bytes" at all - just some bytes copied from /dev/urandom
into "chinese-garbage.pdf". You never know what gets uploaded :-)
As of v4.0.0, Invalid codepoints cause the collator to raise a TwitterCldr::Collation::UnexpectedCodePointError
.