Impossible to set correct collation
vsalvino opened this issue · 0 comments
Describe the bug
Use case: the database is configured with a character set, and a non-default collation for that character set.
[mysqld]
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_ciFor reference, note that on this particular installation (MariaDB 10.3), the default collation for utf8mb4 is utf8mb4_general_ci, NOT utf8mb4_unicode_ci.
The bug in this case is that when using mysqlclient, it is IMPOSSIBLE to set the correct collation. There is an option charset, which whether it is provided, blank, or empty, issues a command as such:
SET NAME <charset>SET NAME will ALWAYS use the default collation for the charset if the collation is not specified, which is documented here: https://mariadb.com/kb/en/set-names/, and much more deeply documented here: https://dev.mysql.com/doc/refman/8.0/en/charset-connection.html.
In this case we are specifying charset="utf8mb4". The command is being generated is:
SET NAMES utf8mb4When COLLATION is omitted, this uses the default collation for utf8mb4 which utf8mb4_general_ci. This is hard-coded into MariaDB and cannot be changed: https://mariadb.com/kb/en/change-is_default-value-for-information_schemacollations/
The problem is that certain SQL queries, such as a CAST as generated by the Django Cast function, will infer the connection's collation, and can fail if the connection's collation does not match the table:
(1267, "Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_general_ci,IMPLICIT) for operation '='")
The simple documented solution is to also specify the collation:
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ciI have tried several workarounds to set the collation, which currently fail, because set_character_set is always run AFTER the workaround, therefore negating it.
Workaround 1
Use the init_command option to send SET NAME utf8mb4 COLLATE utf8mb4_unicode_ci after connecting. Unfortunately, from examining the query log, charset is set AFTER init_command, therefore negating it. The query log looks as such:
31 Query SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci -- run by init_command
31 Query SET NAMES utf8mb4 -- from mysqlclientWorkaround 2
Use the init_connect option on the MySQL server. Unfortunately, the story is the same, as the charset is issued after the init_connect script is run, therefore negating it.
32 Query SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci -- run by the server
32 Query SET NAMES utf8mb4 -- from mysqlclientPossible fixes
Fix 1: Do not set charset if empty.
If charset is empty, or possibly None, do not set it. This way, it does not override the server's default behavior or the init_command
mysqlclient/MySQLdb/connections.py
Lines 195 to 197 in 5c04abf
It seems this behavior (which is the source of my bug) was introduce in 2.1: #509
Fix 2: Add a collation option.
This deviates from the MySQL C connector behavior, but it would be really nice to be able to specify both "charset" and "collation". Special implementation in Python might be required for this, e.g. to issue a SQL command after connecting.
Environment
MySQL Server
- MariaDB (Debian 11, Windows 10)
- MariaDB 10.3.35
MySQL Client
-
Debian, Windows
-
Python tested on multiple versions: 3.7+