opengeospatial/ogcapi-features

CQL2: Additional guidance on CASEI / ACCENTI

jerstlouis opened this issue · 17 comments

Having spent some amount of time digging through the Unicode resources to try to figure things out, I think it would be very useful to provide short and to-the-point additional informative guidance on what is meant instead of just referencing the standard for those who are implementing this manually. From my initial understanding:

CASEI

Essentially you want to use this file https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt (derived from UnicodeData.txt field 13 for simple lowercase mapping and https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt for multiple characters mapping) and replace characters by the sequence on lines with a 'C'(ommon) or a 'F'(full)

ACCENTI

Essentially you need fields 3 and 5 from https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
and apply Normalization Form (K?) D (NF(K?)D: canonical (compatibility?) decomposition) by:

if(codePoint >= 0xAC00 && codePoint < 0xD7B0)
{
   unsigned int syllable = codePoint - 0xAC00;
   unsigned int t = syllable % 28, v, l;
   syllable /= 28;
   v = syllable % 21, l = syllable / 21;
   add(0x1100 + l);
   add(0x1161 + v);
   if(t) add(0x11A7 + t);
}
  • Applying the Canonical Ordering algorithm which is stable-sorting (e.g., bubble-sort) the decomposed mapping code points by the value of that combining class for any sub-string where the Combining Class (field 3) (also in https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedCombiningClass.txt) value is non-zero
    (this step is necessary if there were combining marks in the source text; the Decomposition Mappings should otherwise already be in the correct order since version 2.1.9). -- NOTE: the canonical ordering will only matter for codepoints that do not get stripped, so it will not matter for any of the combining characters that are non-spacing marks)

then:

  • Removing all Nonspacing Marks (category Mn) (general category is field 2 of UnicodeData.txt, also available in https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt). However, an exception should probably be made for some characters categorized as Mn, as some like the Japanese voicing marks (dakuten and handakuten) drastically change the consonants sound which would be analogous to changing k -> g or h -> p in latin alphabets (see note below).

@jerstlouis these files are HUGE ... you expect us to put these in the specification!?

@pvretano Of course not :)
Just a link to them as informative guidance.
The difficult thing is to figure out which files / which fields and what you need to do with them from the Unicode standards / documents.

@pvretano I think Requirement 9 is missing a clause about actually stripping the accents (nonspacing marks) after the normalization. Normalization by itself does not remove accents, it only ensures a particular way to encode them.

See also #850 about whether additional types of character folding is necessary to perform, such as diacritic folding (that would turn københavn into kobenhavn), or hiragana folding / katakana folding that would allow to search Japanese text using either the hiragana or katakana forms.

Meeting 2023-07-31:

  • We need to add a recommendation to apply unicode normalization before any string comparison (currently this is section D of Req 9). We will recommend to use NFD for normalization.
  • In Req 9 we need to replace section D with support for accent stripping, diacritic folding and one of the Japanase foldings (likely Katakana folding).
  • In Req 9 we also need to be clear about the normalization. Since most implementations will map ACCENTI to their backend capabilities, we need to survey what databases support with simple configuration options. Typically an existing collation would be used. @pvretano will review what common database solutions support out-of-the-box.
  • The ATS needs to be updated / extended, too.

This also addresses #850.

We are working on an open-source implementation of CASEI and ACCENTI in our cross-platform SDK (written in eC, which transpiles to C, and has bindings for C, C++ and Python).
It is now mostly complete and available in the default branch (latest) of the ecere-sdk repo.
See https://github.com/ecere/ecere-sdk/blob/latest/ecere/src/sys/unicode.ec for the code and https://github.com/ecere/ecere-sdk/tree/latest/ecere/res/unicode for the embedded resources used in the code which are derived from the Unicode database.

Our latest finding is that the non-spacing mark Unicode chategory Mn includes in it the Japanese voicing and half-voicing marks that are added to hiragana and katakana to change for example "shi" into "ji", and "ha" into "pa", respectively.

These are the dakuten ( ゛) U+3099 and handakuten (゜) U+309A (though the characters here are the non-combining ones that fonts can display properly, rather than the combining ones that the hiragana/katakana letters with them decompose to. (see also https://en.wikipedia.org/wiki/Dakuten_and_handakuten)

Stripping these non-spacing marks in this context is a rather losssy change that would turn はじめ "hajime" (beginning) into はしめ "hashime" (fastener). There could be an argument that other diacritic folding are just as disruptive, but in this case the transcriptions using latin letters (romaji) would be very different.

So I think there should at least be some flexibility in accenti() to allow not all non-spacing marks (Mn) to be stripped. Our implementation will avoid stripping U+3099 and U+309A in this case (these two specific marks could be part of the requirement or recommendation).

Although from Wikipedia:

The handakuten is an innovation by Portuguese Jesuits, who first used it in the Rakuyōshū. These Jesuits had the need to accurately transcribe Japanese sounds, which the Japanese tended to neglect by making no distinction between /h/, /b/ and /p/ in their own writing.

interesting topic to what extent this disctinction is important.

I surveyed the common databases (Oracle, Postgres, MySQL, etc.) and while most have functions that will convert the case of strings only Postgres seems to have a function that will "unaccent" a string. The other databases have functions like "CONVERT()" and "TRANSLATE()" which allow users to explicitly map one character to another ... so theorectically you could implement an ACCENTI() function in the other databases by using something like PL/SQL (for example) to capture all the necessary login on the back end.

@pvretano In general, I believe that collations are used for such purpose. That's definitely the way for SQLite, but also seems to be the case for MySQL per the top-answer on S/O:

https://stackoverflow.com/questions/4813620/how-to-remove-accents-in-mysql

In fact they mention built-in accent-insensitive collations (_ai_ci) introduced in MySQL 8.0.

All the databases I surveyed have a function that will apply case folding to a string. However, only Postgres seems to have a function that specifically applies accent stripping and diactritic folding to a string.

Yes, collations are typically used to support case insensitive (_ci) and accent insensitive (_ai) searching and sorting but the use of collations typically requires setting a global or session variable (as the MySQL example shows) OR passing the desired collation to a language-specific function (e.g. NLS_SORT()) that does not return a string all of which makes implementing an ACCENTI() function in CQL2 a little awkward. Furthermore, implementing a dedicated ACCENTI() function in a CQL2 implementation, rather than translating it into some underlying database engine clause, is also not exactly trivial (as @jerstlouis showed).

Perhaps, instead of CASEI() and ACCENTI(), we can introduct a "USING COLLATION '....'" clause that can be appended to a predicate to specify the collation to use for string comparisons in the pedicate. Something like:

cql_text="my_property='ñÉäßÑéàâ' and S_INTERSECTS(footprint,POLYGON((43.5845 -79.5442,43.6079 -79.4893,43.5677 -79.4632,43.6129 -79.3925,43.6223 -79.3238,43.6576 -79.3163,43.7945 -79.1178,43.8144 -79.1542,43.8555 -79.1714,43.7509 -79.6390,43.5845 -79.5442))) using collation 'binary_ci_ai'"

The problems with this approach are:

  • I'm not sure if the set of collation names are standarized.
  • I'm not sure that the convention of appending _CI and _AI to the collation name to signal case and accent insensitive behaviour is standardized.

Maybe we can get around this problem by saying that CQL2 implemenations must support the UNICODE (UTF-8 or UTF-16) character set for strings with the BINARY collation and the qualifiiers _AI and/or _CI can be appended to that name to indicate case and/or accent insensitive string comparisons. Other character sets can be used too but they are not specified in the standard.

Just thinking out loud here ... Thoughts?

setting a global or session variable (as the MySQL example shows) OR passing the desired collation to a language-specific function

I'm not sure what that involves exactly, but if it can be done in a way that affects only the current CQL2 query, then there should be no issue?

we can introduct a "USING COLLATION '....'" clause that can be appended to a predicate to specify the collation to use for string comparisons in the predicate.

Wouldn't it be possible to automatically translate the use of the ACCENTI() / CASEI() calls to either USING COLLATION, allowing to use whicever collation name is appropriate, OR to the COLLATE approach, whichever is easier on a given RDBMS? That would avoid the problems of non-standardized names, and would avoid having to change the spec.

Although I understand that the intent of CQL2 is to easily support different database backends, I feel that it is unrealistic to expect to just throw the CQL2 query at any RDBMS and expect it to work as-is without first parsing it and adapt it to the backend RDBMS.

implementing a dedicated ACCENTI() function in a CQL2 implementation, rather than translating it into some underlying database engine clause, is also not exactly trivial (as @jerstlouis showed).

It's really not that bad, and it's a lot easier, once you have a clear idea what to do (which is what this issue aims to facilitate) :)
See our open-source (BSD-3) implementation of casei() and accenti(), including Unicode normalization (NFC/NFD/NFK/NFKC), and hangul (de)composition. It should be relatively easy to follow by example, port the code to any C-derived language, or one could also directly invoke our casei() / accenti() implementation from the automatically generated C, C++ or Python bindings for libecere (happy to help anyone set that up).

From scratch DE-9IM is a lot more difficult (and fun)! :P

@jerstlouis our implementation of CQL2 does exactly that ... i.e. translate the CQL2 expression to the SQL of an underlying database make all necessary translations to work. I guess I was not sure if that was too high an implementation bar so I proposed the USING COLLATION clause which seems closer to what underlying databases do.

Anyway, sounds like what we have is OK. So, I think we should just go with that.

make all necessary translations to work.

right, I mean I would expect such translations to be needed.

USING COLLATION clause which seems closer to what underlying databases do.

I'm wondering what that would look like if you have multiple comparisons but only want to use AI_CI for some of them?

Anyway, sounds like what we have is OK. So, I think we should just go with that.

+1 for not changing the status quo.

I believe from a CQL2 parser perspective, a USING clause would be a major nightmare for most implementers -- for sure for us.
Our top level AST construct is an "expression" but USING is an SQL thing that belongs outside of the whole expression clause (for the same reason we don't have a SELECT, FROM, ORDER BY keywords in CQL2).

In this issue I was mainly hoping to provide more guidance about what a from-scratch actual collation (or other implementation of accenti() / casei()) is supposed to do, in terms of documenting the types of Unicode normalization, which one(s) to use where, the steps involved in normalization and subsequent accent stripping and diacritic folding, and references to the relevant Unicode files, so as to avoid implementers having to research for days what they need to do from Unicode resources spread all over the place (and not always obvious).

Documenting the use of existing AI/CI collations for particular RDBMSs would also be very useful for those taking that path.

I agree that we should keep the current approach at this point. It hides the complexity from the user. If it turns out that ACCENTI() is too hard to implement, we can always consider an alternative approach (and deprecate ACCENTI()) in a future version as it is a separate requirements class.

I was reading parts of the Unicode spec yesterday. The topic looks really complex for a newcomer (like CRSs for those not familiar with CRSs). Unless the backends make it easy for CQL2 implementations to support ACCENTI() I think it will be implemented only by those that really need it.

We currently implement support for ACCENTI() only for SQL backends and as a fixed configuration option to specify the collation that will be applied to the value.

@jerstlouis, if I understand it correctly, you suggest that PR #857 is extended with SQL guidance? Do you think that what the PR says about Unicode is sufficient? We should avoid making CQL2 a Unicode tutorial.

@cportele

if I understand it correctly, you suggest that PR #857 is extended with SQL guidance?

Correct, based on what @pvretano mentioned about TRANSLATE() and CONVERT() or the COLLATE keyword.

Looking more into this, I think COLLATE is actually typically used at the query level e.g., https://dev.mysql.com/doc/refman/8.0/en/charset-collate.html .
I don't know if the USING COLLATION that @pvretano mentioned is something new rather than something implemented by any RDBMS? MS SQL Server, MySQL, Postgres, Oracle and SQLite all support the COLLATE keyword at the query level (as a unary postfix operator).

I would be much less against something that qualifies a comparison expression than something that goes at the very end and does not really fit the expression model, and we could potentially simply define 3 built-in "CQL2" collations that implementors can map to whichever concept they have: COLLATE [cql2_ci | cql2_ai_ci | cql2_ai].

This would also avoid the annoyance of having to repeat casei() and accenti() on both sides all the time.
It could be added as an optional addition to the binaryComparisonPredicate BNF rule.

In CQL2-JSON, it could be a new collate property for binaryComparisonPredicate with an enum of ci, ai_ci, ai, or alternatively two boolean properties caseInsensitive and accentInsensitive defaulting to false.

The above example query would be written instead:

filter=my_property='ñÉäßÑéàâ' COLLATE cql2_ai_ci
   and S_INTERSECTS(footprint,
      POLYGON((43.5845 -79.5442,43.6079 -79.4893,43.5677 -79.4632,43.6129 -79.3925,43.6223 -79.3238,
         43.6576 -79.3163,43.7945 -79.1178,43.8144 -79.1542,43.8555 -79.1714,43.7509 -79.6390,
         43.5845 -79.5442)))

which I think is closer to how it's actually done in RDBMSs.

Perhaps something to reconsider.

Do you think that what the PR says about Unicode is sufficient? We should avoid making CQL2 a Unicode tutorial.

I think it is sufficient in terms of quantity of content, at first glance it seems to be following my recommendations of what to add above.
However, I did not get the chance to review the PR yet, and I would like to do so (though we can merge it first and I can submit a new PR if I see any anything that should be fixed / improved).

I think it's a very to-the-point "short/quick implementation guide" that tells you most of what you need to know without getting lost in lengthy explanations, and would be very useful for anyone actually implementing this from scratch like we did (we would probably have saved weeks of work with this).

Meeting 2023-01-15: The proposal from @jerstlouis has the advantage of not having to use CASEI or ACCENTI on both sides of the operator, but it is quite late for such a change and, e.g., STAC API may already use CASEI or ACCENTI. @pvretano will check.

Providing SQL guidance seems like too much work, also because all implementations are different. Maybe we can add a sentence that in a SQL database this will typically be implemented using a collation.

Code Sprint 2024-02-26: @pvretano has updated the text. We prefer to keep CASEI and ACCENTI at this stage. We can always deprecate them and introduce a new capability in a version 1.x.

@cportele and @jerstlouis should do a final check and we then merge.