Uby MySQL dump does not handle UTF-8 properly

Question

Uby MySQL dump does not handle UTF-8 properly

Opened this issue 8 years ago · 11 comments

When querying an UBY mysql dump for German lemmas with umlauts this is not handled correctly:
e.g. querying for "sägen" returns all entries for "sägen" and "sagen".

This can be reproduced in the Uby web browser, on the command line with pure mysql and using the Uby-API.

This issue does not occur with the H2 database (another point in favor of using H2).

Answer 1 · 2016-11-24T20:07:02.000Z

That may depend on how the encoding in the mysql db has been configured and whether that configuration is included in the mysql dump. A properly configured mysql should support UTF-8.

Answer 2 · 2016-11-24T20:10:32.000Z

We always used this configuration for the creation of the database (before starting the actual import):

CREATE SCHEMA DATABASE_NAME DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci ;

Answer 3 · 2016-11-24T20:12:02.000Z

Hm. Do the tables that were then created by Hibernate also reflect the encoding and collation? In principle, you can set these parameters individually for each table.

Answer 4 · 2016-11-24T20:23:04.000Z

the tables are created via Hibernate in LMFDBUtils
https://github.com/dkpro/dkpro-uby/blob/master/de.tudarmstadt.ukp.uby.persistence.transform-asl/src/main/java/de/tudarmstadt/ukp/lmf/transform/LMFDBUtils.java

this is not done for each table individually.
apparently H2 has no problems with this way of creating tables, while MySQL has.

Answer 5 · 2016-11-24T20:34:47.000Z

Does the MySQL dump file include lines with reference to UTF-8? E.g.

/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `my-table` (
...
) ENGINE=InnoDB AUTO_INCREMENT=12001 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

Answer 6 · 2016-11-24T20:38:27.000Z

yes:

C:\Users\Judith>more uby_open_0_7_0_nonfree.sql

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0
*/;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

/*!40000 DROP DATABASE IF EXISTS `uby_open_0_7_0`*/;

CREATE DATABASE /*!32312 IF NOT EXISTS*/ `uby_open_0_7_0` /*!40100 DEFAULT CHARA
CTER SET utf8 */;

USE `uby_open_0_7_0`;
DROP TABLE IF EXISTS `ArgumentRelation`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `ArgumentRelation` (
  `argumentRelationId` bigint(20) NOT NULL,
  `relType` varchar(255) DEFAULT NULL,
  `relName` varchar(255) DEFAULT NULL,
  `target` varchar(255) DEFAULT NULL,
  `semanticArgumentId` varchar(255) DEFAULT NULL,
  `idx` int(11) DEFAULT NULL,
  PRIMARY KEY (`argumentRelationId`),
  KEY `argumentrelation_target_IDX` (`target`),
  KEY `FK3A19F3F9D07C6108` (`semanticArgumentId`),
  CONSTRAINT `FK3A19F3F9D07C6108` FOREIGN KEY (`semanticArgumentId`) REFERENCES
`SemanticArgument` (`semanticArgumentId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

LOCK TABLES `ArgumentRelation` WRITE;
/*!40000 ALTER TABLE `ArgumentRelation` DISABLE KEYS */;
.....

Answer 7 · 2016-11-24T20:51:12.000Z

I think we had that problem as well with WebAnno until we set the default server encoding (and updated the documentation accordingly):

make sure your MySQL server is configured for UTF-8. Check the following line is present in /etc/mysql/my.cnf:

character-set-server = utf8

Source: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/WebAnno%20(GitHub)%20(master)/de.tudarmstadt.ukp.clarin.webanno$webanno-doc/doclinks/2/#_prepare_database

It seems that can alternatively be specified on the command-line when importing. Cf: https://makandracards.com/makandra/595-dumping-and-importing-from-to-mysql-in-an-utf-8-safe-way

But I believe if you only specify it on the command-line instead of the my.cnf, then you'll also have to set the encoding on the JDBC connection string. Cf: http://stackoverflow.com/questions/13234433/utf8-garbled-when-importing-into-mysql

Answer 8 · 2016-11-25T10:18:05.000Z

@judithek Unfortunately, that is the expected behavior for the utf8_general_ci collation in mysql. If you want to differentiate between sägen and sagen in your query, the collation must be utf8_bin.

Do this and you'll see it:

create table test_unicodeci (a varchar(20)) character set utf8 collate utf8_unicode_ci;
create table test_bin (a varchar(20)) character set utf8 collate utf8_bin;

insert into test_unicodeci values ("sagen");
insert into test_unicodeci values ("sägen");

insert into test_bin values ("sagen");
insert into test_bin values ("sägen");

select * from test_unicodeci where a = "sägen";
+--------+
| a      |
+--------+
| sagen  |
| sägen  |
+--------+

select * from test_bin where a = "sägen";
+--------+
| a      |
+--------+
| sägen  |
+--------+

Answer 9 · 2016-11-25T10:22:03.000Z

Note, however, that if you use the utf8_bin collation, only case sensitive searches are possible... :(

select * from test_unicodeci where a = "Sägen";
+--------+
| a      |
+--------+
| sagen  |
| sägen  |
+--------+

select * from test_bin where a = "Sägen";
Empty set (0.00 sec)

Answer 10 · 2016-11-25T15:54:37.000Z

@reckart @betoboullosa thanks a lot for your comments and insights!

@betoboullosa IMO think case sensitive searches perfectly make sense for a lexical resource

Answer 11 · 2016-11-25T16:08:27.000Z

@judithek Yes, case sensitivity is OK for a lexical resource. It might be a problem if searching for words in sentences for example.