Conversion from MariaDB in encoding utf8mb4 to sqlite with utf8 produces invalid encoding output
Sahara150 opened this issue · 5 comments
When I convert my MySQL database to a SQLite Database, it produces some weird output and my IDE tells me, that this doesn´t seem to be UTF-8 for sure. The output begins like this:
SQLite format 3 � �� @ G �· 9 � � G .4 � ��ñ ���û�ö�ñ�ë .�Ç�ã�p�&�·�����*�" � g���=����indexfrench_index_French_typefrench�CREATE INDEX "french_index_French_type" ON "french" ("type")p���C����
And then continues with the text, having weird symbols from time to time.
Versions:
OS: Windows 10 Home
MariaDB: 10.4.11
Python: 3.8.2
MySQL-Encoding info from dump:
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;
All the tables are in utf8mb4 encoding, since it also contains letters like é, ç, ā, ä, etc. (most of them are also part of utf8, but I am not sure about all of them)
Hey,
Thanx for the report.
Can you provide me with a partial dump of the database so that I can investigate, unless it's too sensitive data.
Also, can you please run mysql2sqlite --version
so that I can see all the versions of the dependencies?
On top of that, can you open the SQLite dump with a tool like DBeaver and check if it also shows weird encoding?
SQLite database dump:
SQLite format 3 � �� @ G �· 9 � � G .4 � ��ñ ���û�ö�ñ�ë .�Ç�ã�p�&�·�����*�" � g���=����indexfrench_index_French_typefrench�CREATE INDEX "french_index_French_type" ON "french" ("type")p���C����indexfrench_index_French_articlefrench�CREATE INDEX "french_index_French_article" ON "french" ("article")�/
������9tablefrenchfrench�CREATE TABLE "french" (
"id" INTEGER NOT NULL ,
"article" TEXT NULL ,
"word" TEXT NOT NULL ,
"type" INTEGER NOT NULL ,
PRIMARY KEY ("id"),
FOREIGN KEY("id") REFERENCES "word" ("id") ON UPDATE NO ACTION ON DELETE CASCADE,
FOREIGN KEY("id") REFERENCES "word" ("id") ON UPDATE NO ACTION ON DELETE CASCADE,
FOREIGN KEY("type") REFERENCES "wordtypes" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION
)m��A����indexenglish_index_English_typeenglish
CREATE INDEX "english_index_English_type" ON "english" ("type")�G�������etableenglishenglishCREATE TABLE "english" (
"id" INTEGER NOT NULL ,
"word" TEXT NOT NULL ,
"type" INTEGER NOT NULL ,
PRIMARY KEY ("id"),
FOREIGN KEY("id") REFERENCES "word" ("id") ON UPDATE NO ACTION ON DELETE CASCADE,
FOREIGN KEY("type") REFERENCES "wordtypes" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION
)q
��7%���indexdeclinations_strengthdeclinations�CREATE INDEX "declinations_strength" ON "declinations" ("strength")�a ����%��%indexdeclinations_index_Declinations_genus_case_strength_numerusdeclinations
CREATE UNIQUE INDEX "declinations_index_Declinations_genus_case_strength_numerus" ON "declinations" ("genus", "case", "strength", "numerus")e���/%���indexdeclinations_casedeclinations CREATE INDEX "declinations_case" ON "declinations" ("case")�:���%%��7tabledeclinationsdeclinations�CREATE TABLE "declinations" (
"id" INTEGER NOT NULL ,
"genus" INTEGER NOT NULL ,
"case" INTEGER NOT NULL ,
"strength" INTEGER NOT NULL ,
"numerus" INTEGER NOT NULL DEFAULT '0' ,
"ending" TEXT NOT NULL ,
"alternativeEnding" TEXT NOT NULL ,
"lettersWithAlternative" TEXT NOT NULL ,
"lettersWithoutEnding" TEXT NOT NULL ,
"article" TEXT NOT NULL ,
PRIMARY KEY ("id"),
FOREIGN KEY("genus") REFERENCES "articles" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION,
FOREIGN KEY("case") REFERENCES "cases" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION,
FOREIGN KEY("strength") REFERENCES "strengths" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION
)z�������Stablecasescases�CREATE TABLE "cases" (
"id" INTEGER NOT NULL ,
"caseName" TEXT NOT NULL ,
PRIMARY KEY ("id")
)Z���+��{indexbeorhave_wordIdbeorhave�CREATE INDEX "beorhave_wordId" ON "beorhave" ("wordId")�a��������tablebeorhavebeorhave�CREATE TABLE "beorhave" (
"entryId" INTEGER NOT NULL ,
"wordId" INTEGER NOT NULL ,
"person" INTEGER NOT NULL ,
"numerus" INTEGER NOT NULL ,
"word" TEXT NOT NULL ,
"transitive" INTEGER NOT NULL ,
PRIMARY KEY ("entryId"),
FOREIGN KEY("wordId") REFERENCES "german" ("id") ON UPDATE NO ACTION ON DELETE NO ACTION
)�����W���Cindexarticles_index_Articles_germanArticlearticles�CREATE UNIQUE INDEX "articles_index_Articles_germanArticle" ON "articles" ("germanArticle")�����W���Cindexarticles_index_Articles_frenchArticlearticles�CREATE UNIQUE INDEX "articles_index_Articles_frenchArticle" ON "articles" ("frenchArticle")�&��������tablearticlesarticles�CREATE TABLE "articles" (
"id" INTEGER NOT NULL ,
"germanArticle" TEXT NOT NULL ,
"frenchArticle" TEXT NULL ,
PRIMA ��+ �� ��
��Ô �õ�ê�á�Ô ��� � der/die��� � das �� ��diela �� ��derle
��é �î�é�ó�ú �� ���� ������la���� le
��Ý �é�ù�Ý�ñ ����der/die�����das�����die���� der
�
�
��Ê �ñ�ä�Ù�Ê
�� !accusative �� �dative��� �genitive
�� !nominative
...
MySQL Database Dump:
-- phpMyAdmin SQL Dump
-- version 5.0.2
-- https://www.phpmyadmin.net/
--
-- Host: 127.0.0.1
-- Erstellungszeit: 02. Aug 2021 um 10:03
-- Server-Version: 10.4.11-MariaDB
-- PHP-Version: 7.4.4
SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
START TRANSACTION;
SET time_zone = "+00:00";
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;
--
-- Datenbank: `dictionarydb`
--
-- --------------------------------------------------------
--
-- Tabellenstruktur für Tabelle `articles`
--
CREATE TABLE `articles` (
`id` int(11) NOT NULL,
`germanArticle` text NOT NULL,
`frenchArticle` text DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
--
-- Daten für Tabelle `articles`
--
INSERT INTO `articles` (`id`, `germanArticle`, `frenchArticle`) VALUES
(1, 'der', 'le'),
(2, 'die', 'la'),
(3, 'das', NULL),
(4, 'der/die', NULL);
-- --------------------------------------------------------
--
-- Tabellenstruktur für Tabelle `beorhave`
--
CREATE TABLE `beorhave` (
`entryId` int(11) NOT NULL,
`wordId` int(11) NOT NULL,
`person` int(11) NOT NULL,
`numerus` int(11) NOT NULL,
`word` text NOT NULL,
`transitive` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
-- --------------------------------------------------------
--
-- Tabellenstruktur für Tabelle `cases`
--
CREATE TABLE `cases` (
`id` int(11) NOT NULL,
`caseName` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
--
-- Daten für Tabelle `cases`
--
INSERT INTO `cases` (`id`, `caseName`) VALUES
(1, 'nominative'),
(2, 'genitive'),
(3, 'dative'),
(4, 'accusative');
-- --------------------------------------------------------
--
-- Tabellenstruktur für Tabelle `declinations`
--
CREATE TABLE `declinations` (
`id` int(11) NOT NULL,
`genus` int(11) NOT NULL,
`case` int(11) NOT NULL,
`strength` int(11) NOT NULL,
`numerus` int(11) NOT NULL DEFAULT 0,
`ending` text NOT NULL,
`alternativeEnding` text NOT NULL,
`lettersWithAlternative` text NOT NULL,
`lettersWithoutEnding` text NOT NULL,
`article` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
...
Output from mysql2sqlite --version:
| software | version |
|------------------------|--------------------------------------|
| mysql-to-sqlite3 | 1.4.4 |
| | |
| Operating System | Windows 10 |
| Python | CPython 3.8.2 |
| MySQL | MySQL client not found on the system |
| SQLite | 3.28.0 |
| | |
| click | 8.0.1 |
| mysql-connector-python | 8.0.26 |
| python-slugify | 5.0.2 |
| pytimeparse | 1.1.8 |
| simplejson | 3.17.3 |
| six | 1.16.0 |
| tabulate | 0.8.9 |
| tqdm | 4.62.0 |
MariaDB is part of shipped Apache server, I guess that´s why it´s not able to show the version. (My xampp folder is not part of the programs folder) However since it generates a file of the db at all, I don´t think that´s an issue.
Apache versions:
+ Apache 2.4.43
+ MariaDB 10.4.11
+ PHP 7.4.4 (VC15 X86 64bit thread safe) + PEAR
+ phpMyAdmin 5.0.2
+ OpenSSL 1.1.0g
+ ADOdb 518a
+ Mercury Mail Transport System v4.63 (not included in the portable version)
+ FileZilla FTP Server 0.9.41 (not included in the portable version)
+ Webalizer 2.23-04 (not included in the portable version)
+ Strawberry Perl 5.16.3.1 Portable
+ Tomcat 7.0.103
+ XAMPP Control Panel Version 3.2.4.
+ XAMPP mailToDisk 1.0 (write emails via PHP on local disk in <xampp>\mailoutput. Activated in the php.ini as mail default.)
Unfortunately I can´t download the tool in a reasonable time, since I am in a country with veeeery bad internet. Right now it´s showing 16hours download time for 90MB. I´m still downloading it, but I won´t know the result today.
Note: I just found out, that apparently the utf8mb4 encoding of MySQL is the real utf8 encoding, since their "utf-8" encoding only supports 3 bytes, not 4 as the standard should. Which makes it even more important, that the conversion in that encoding works and even more weird, that it doesn´t seem to do so. (I would expect that there´s no additional code needed for that, since it actually is the same encoding, just a different name. :O)
Wait what do you mean by "SQLite database dump"?
I think you missunderstand what that file is.
Try opening it with sqlite.exe
or wait for DBeaver to finish downloading (so sorry to hear about your internet problems - reminds me of the 90s and 56k dial-up when everything was painfully slow
Yea, utf8mb4 basically added support for storing emojis in the database.
Oooooh, so it´s not meant to be a .sql file, but a .db file? :OOO
That would explain a lot. :D
Alright, I will try to open it with DBeaver, when it finally finished downloading.
Yep, the internet connection is indeed in the state of the 90s. :D But you really learn a lot about data optimization, when you are coding here. ;)
Hehe, yes, that's not a dump file but an actual working database
Let me know if you still have problems once you have opened it in a database viewer, like DBeaver etc.