cockroachdb/cockroach

sql: Support `CITEXT` type alias for case-insensitive text

bdarnell opened this issue · 21 comments

We have limited support for case-insensitive column types via a special and undocumented collation. Postgres has a CITEXT type for this purpose. Supporting CITEXT as an alias for TEXT COLLATE en_u_ks_level2 would make this much more discoverable and usable.

gz#11468

Jira issue: CRDB-5855

Good idea. Note we'll have to make them interact with collations properly as well, which seems slightly confusing:

As of PostgreSQL 9.1, you can attach a COLLATE specification to citext columns or data values. Currently, citext operators will honor a non-default COLLATE specification while comparing case-folded strings, but the initial folding to lower case is always done according to the database's LC_CTYPE setting (that is, as though COLLATE "default" were given). This may be changed in a future release so that both steps follow the input COLLATE specification.

Yeah. That doesn't seem too bad, though. Just attach _u_ks_level2 to whatever base collation is specified (and disallow any non-basic collations)

This might be a good starter project.

I need to start work on this issue , please any one give me hints how to get started.

@knz @jordanlewis
i need your help how to get started work on this issue.

shaqq commented

👋 @bdarnell, I was thinking about tackling this issue... but I'm also new to the codebase.

I'm assuming you want to avoid a hacky approach, like looking for the string CITEXT in the original sql statement and replacing it with TEXT COLLATE en_u_ks_level2 (that also wouldn't allow you to add a COLLATE statement to an existing CITEXT column)

From reading the tech notes, it seems like I'd have to modify the parser/sql.y to recognize the CITEXT token and... perhaps call it out as something special. And then in sem/tree/normalize.go look for the CITEXT token in the AST and swap it out by adding _u_ks_level2 to the base collation?

I don't feel like I quite have it right, but I wanted to ask before getting into a rabbit hole

knz commented

Ben is on parental leave and I can help instead. The technical approach you have identified is sound and seems like a good direction to explore.

shaqq commented

So I've looked at implementing this feature for a bit, and had a few questions on intended functionality.

If we naively alias CITEXT as en_u_ks_level2, we can have functionality like so:

(this is an example logictest)

statement ok
CREATE TABLE t (
  name CITEXT PRIMARY KEY,
  num  INT
)

statement ok
INSERT INTO t VALUES
  ('larry' COLLATE en_u_ks_level2, 1),
  ('Tom' COLLATE en_u_ks_level2, 2),
  ('Damian' COLLATE en_u_ks_level2, 3),
  ('NEAL' COLLATE en_u_ks_level2, 4),
  ('Bjørn' COLLATE en_u_ks_level2, 5)

query I
SELECT num FROM t WHERE name = 'neal' COLLATE en_u_ks_level2
----
4

This is something I have implemented currently, and I can get that logic test passing. However, it's not quite how it works in Postgresql, which would be like so:

statement ok
CREATE TABLE t (
  name CITEXT PRIMARY KEY,
  num  INT
)

statement ok
INSERT INTO t VALUES
  ('larry', 1),
  ('Tom', 2),
  ('Damian', 3),
  ('NEAL', 4),
  ('Bjørn', 5)

query I
SELECT num FROM t WHERE name = 'neal'
----
4

This seems a bit more tricky to implement. CITEXT is not a primary data type in Postgres - it actually appears as a USER-DEFINED type, which CRDB does not have currently. I can look into making this a special type specific to CRDB, though I believe it would cause a divergence with the PG wire protocol since there's no object id associated with CITEXT (please correct me if I'm wrong on this).

If anyone has any thoughts on which way to go, please let me know!

knz commented

Hi @shaqq; first of all, many thanks for your preliminary investigation and clarification of the problem.

First of all a clarification: how did you implement the aliasing? It sounds like you made CITEXT an alias for STRING COLLATE en_u_ks_level2.

With your current implementation a CITEXT column would get the same OID as STRING/TEXT, because that's the OID allocated by CockroachDB to all collated strings (oid.T_text).

Now you are suggesting that CITEXT in PostgreSQL has no object ID. This seems surprising - there is always some ID reported for any data type.

What happens if you do SELECT citext_column FROM ... on a pg server? What OID does the server report for that column?

It may be that pg allocates something else than the oid for T_text; perhaps it's T_any? or some other value corresponding to the entry for the type in pg_type?

I agree that the rest of the investigation here needs to create certainty on this topic. The OID of the column's type, if not T_text, must be clarified and reproduced in the implementation for CITEXT.

It's also important to check whether the OID is stable across postgres installations. The citext module is not pre-built (I assume it can be dynamically loaded), which might mean the OID for its type is not static. If it is not, then that creates ground for CockroachDB to pick some arbitrary value for it.

Finally, I would like to understand what's up with the last query. How come your implementation supports WHERE name = 'neal' COLLATE en_u_ks_level2 but not WHERE name = 'neal'?

shaqq commented

Right exactly - PG returns returns an OID for CITEXT, but it's presumably the behavior for any USER-DEFINED type.

For example, on my PG instance (9.6.16), it has an OID of 51743 in my pg_type catalog, but that OID does not exist in pq/oid/types.go. It doesn't exist in the 9.6 SQL for citext, nor in the underlying implementation. I haven't seen anything in the implementations of the recent versions of PG either.

That suggests that what you're saying here is true:

The citext module is not pre-built (I assume it can be dynamically loaded), which might mean the OID for its type is not static. If it is not, then that creates ground for CockroachDB to pick some arbitrary value for it.

I can look into creating a static, custom OID for a CITEXT type for CRDB. I don't think that requires a modification to the pq library, as we should be able to create something custom for CRDB only. I'm not entirely sure what we want for the SHOW COLUMNS FROM TABLENAME behavior to be though - should we return CITEXT or USER-DEFINED (like PG does)?

As for my previous implementation, it was pretty straightforward - I added syntax in sql.y to pickup CITEXT as a data type, and then modified sql/types/types.go to add a Citext type of family CollatedStringFamily with a hardcoded locale. The collatedstring logic test describes a lot of the behavior for collated strings in CRDB.

However, as we're discovering, this sort of naive aliasing isn't the right way to go. I can look into creating a custom type just for CRDB with a custom OID. Maybe I get something working in a branch and then we go from there?

knz commented

We have a way in the code to assign a different (specific) OID to a type alias. See for example the difference between the objects String and Varchar in pkg/sql/types.

When it comes to allocating a fixed OID, I am generally favorable as long as there's a range of values known to be "reserved for extensions" in postgresql. Also, if/when you add a specific OID for CITEXT, this will need to be reported in the various pg_catalog tables, see sql/pg_catalog.go.

In any case this discussion made me realize this is not exactly "easy" or a "good first issue". We did not know this before you started investigating, but our apologies for that nonetheless. We'll appreciate if you want to continue investigate further, but we'd also understand if you felt overwhelmed and wanted to call it a day.

shaqq commented

Ha no worries! It's a good way for me to get acquainted with the CRDB codebase. Let me take another whack at it and see what I come up with. After that, if it's really far from where we want it to be and someone else wants to take over, I can move onto something else.

Thanks for the tips!

shaqq commented

Hm, I've spun my wheels on this for a bit and not sure where to go. I can't quite find where in the codebase we do comparisons. For instance, for the following query:

statement ok
CREATE TABLE t (
  name TEXT PRIMARY KEY,
  num  INT
)

statement ok
INSERT INTO t VALUES
  ('foo', 1)

query I
SELECT num FROM t WHERE name = 'foo'
----
1

Where does the comparison of whereClauseValue == columnValue, i.e. 'foo' == 'foo' happen in the codebase? My first hunch was in datum.go to look in the (d *DString) Compare function, but I'm not finding that it actually does that comparison from the where clause (it's used elsewhere, for sure).

In any case, I'm thinking about moving onto something else, so if someone else wants to take on this feature, feel free!

Hey @shaqq, the problem you're running into is that, in your query, we don't actually have to perform comparison because the WHERE clause is pushed down into the index.

Try your query again without the PRIMARY KEY definition - that should hit the Compare function.

This would be a very useful feature as long as the collation derivation would be implicit like in Postgres and not explicit as it is currently.

The collation derivation of an expression can be implicit or explicit. This distinction affects how collations are combined when multiple different collations appear in an expression. An explicit collation derivation occurs when a COLLATE clause is used; all other collation derivations are implicit. When multiple collations need to be combined, for example in a function call, the following rules are used:

  1. If any input expression has an explicit collation derivation, then all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation is present, that is the result of the collation combination.

  2. Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation is present, that is the result of the collation combination. Otherwise, the result is the default collation.

  3. If there are conflicting non-default implicit collations among the input expressions, then the combination is deemed to have indeterminate collation. This is not an error condition unless the particular function being invoked requires knowledge of the collation it should apply. If it does, an error will be raised at run-time.

https://www.postgresql.org/docs/13/collation.html

Nican commented

I just want to add a note here that "just adding an alias to a different COLLATE" does not seem to fix all the issues.

I am converting a codebase from MySQL to CockroachDB, and when I tried to use en-US-u-ks-level2 on an username field, and run WHERE "User"."username" = 'Nican', I got the following the error: error: unsupported comparison operator: <collatedstring{en-US-u-ks-level2}> = <string>.

Are there any other easy fixes for case-insensitive indexes?

This is also needed for Elixir/phoenix/ecto/postgrex. Eg

mix phx.gen.auth Accounts User users; mix deps.get; mix ecto.migrate

This is also needed for Elixir/phoenix/ecto/postgrex. Eg

mix phx.gen.auth Accounts User users; mix deps.get; mix ecto.migrate

While this is true, it is generating a migration for Postgres (I'm not sure if it creates a different migration depending on the ecto adapter provided). You are free to edit the migration before running it to match your used database.

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

I would still very much like CITEXT.

I have been using indexes with tolower, but I feel like it adds complexity.