dkpro/dkpro-uby

Uby MySQL dump does not handle UTF-8 properly

Opened this issue · 11 comments

When querying an UBY mysql dump for German lemmas with umlauts this is not handled correctly:
e.g. querying for "sägen" returns all entries for "sägen" and "sagen".

This can be reproduced in the Uby web browser, on the command line with pure mysql and using the Uby-API.

This issue does not occur with the H2 database (another point in favor of using H2).

That may depend on how the encoding in the mysql db has been configured and whether that configuration is included in the mysql dump. A properly configured mysql should support UTF-8.

We always used this configuration for the creation of the database (before starting the actual import):

CREATE SCHEMA DATABASE_NAME DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci ;

Hm. Do the tables that were then created by Hibernate also reflect the encoding and collation? In principle, you can set these parameters individually for each table.

the tables are created via Hibernate in LMFDBUtils
https://github.com/dkpro/dkpro-uby/blob/master/de.tudarmstadt.ukp.uby.persistence.transform-asl/src/main/java/de/tudarmstadt/ukp/lmf/transform/LMFDBUtils.java

  • this is not done for each table individually.
  • apparently H2 has no problems with this way of creating tables, while MySQL has.

Does the MySQL dump file include lines with reference to UTF-8? E.g.

/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `my-table` (
...
) ENGINE=InnoDB AUTO_INCREMENT=12001 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

yes:

C:\Users\Judith>more uby_open_0_7_0_nonfree.sql

/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0
*/;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;

/*!40000 DROP DATABASE IF EXISTS `uby_open_0_7_0`*/;

CREATE DATABASE /*!32312 IF NOT EXISTS*/ `uby_open_0_7_0` /*!40100 DEFAULT CHARA
CTER SET utf8 */;

USE `uby_open_0_7_0`;
DROP TABLE IF EXISTS `ArgumentRelation`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `ArgumentRelation` (
  `argumentRelationId` bigint(20) NOT NULL,
  `relType` varchar(255) DEFAULT NULL,
  `relName` varchar(255) DEFAULT NULL,
  `target` varchar(255) DEFAULT NULL,
  `semanticArgumentId` varchar(255) DEFAULT NULL,
  `idx` int(11) DEFAULT NULL,
  PRIMARY KEY (`argumentRelationId`),
  KEY `argumentrelation_target_IDX` (`target`),
  KEY `FK3A19F3F9D07C6108` (`semanticArgumentId`),
  CONSTRAINT `FK3A19F3F9D07C6108` FOREIGN KEY (`semanticArgumentId`) REFERENCES
`SemanticArgument` (`semanticArgumentId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

LOCK TABLES `ArgumentRelation` WRITE;
/*!40000 ALTER TABLE `ArgumentRelation` DISABLE KEYS */;
.....

I think we had that problem as well with WebAnno until we set the default server encoding (and updated the documentation accordingly):

make sure your MySQL server is configured for UTF-8. Check the following line is present in /etc/mysql/my.cnf:

character-set-server = utf8

Source: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/WebAnno%20(GitHub)%20(master)/de.tudarmstadt.ukp.clarin.webanno$webanno-doc/doclinks/2/#_prepare_database

It seems that can alternatively be specified on the command-line when importing. Cf: https://makandracards.com/makandra/595-dumping-and-importing-from-to-mysql-in-an-utf-8-safe-way

But I believe if you only specify it on the command-line instead of the my.cnf, then you'll also have to set the encoding on the JDBC connection string. Cf: http://stackoverflow.com/questions/13234433/utf8-garbled-when-importing-into-mysql

@judithek Unfortunately, that is the expected behavior for the utf8_general_ci collation in mysql. If you want to differentiate between sägen and sagen in your query, the collation must be utf8_bin.

Do this and you'll see it:

create table test_unicodeci (a varchar(20)) character set utf8 collate utf8_unicode_ci;
create table test_bin (a varchar(20)) character set utf8 collate utf8_bin;

insert into test_unicodeci values ("sagen");
insert into test_unicodeci values ("sägen");

insert into test_bin values ("sagen");
insert into test_bin values ("sägen");

select * from test_unicodeci where a = "sägen";
+--------+
| a      |
+--------+
| sagen  |
| sägen  |
+--------+

select * from test_bin where a = "sägen";
+--------+
| a      |
+--------+
| sägen  |
+--------+

Note, however, that if you use the utf8_bin collation, only case sensitive searches are possible... :(

select * from test_unicodeci where a = "Sägen";
+--------+
| a      |
+--------+
| sagen  |
| sägen  |
+--------+

select * from test_bin where a = "Sägen";
Empty set (0.00 sec)

@reckart @betoboullosa thanks a lot for your comments and insights!

@betoboullosa IMO think case sensitive searches perfectly make sense for a lexical resource

@judithek Yes, case sensitivity is OK for a lexical resource. It might be a problem if searching for words in sentences for example.