taku910/mecab

matrix right/left dimension checking is inconsistent (compiling user dictionary/assigning user dict costs)

Opened this issue · 3 comments

with unidic as downloaded:

$ mecab/bin/mecab-dict-index.exe -m mecab/dic/unidic-csj/model.bin -d mecab/dic/unidic-csj/ -f utf-8 -c utf-8 -a userdict.csv -u user2.csv
dictionary.cpp(183) [cid.left_size() == matrix.left_size() && cid.right_size() == matrix.right_size()] Context ID files(mecab/dic/unidic-csj/\left-id.def or mecab/dic/unidic-csj/\right-id.def may be broken: 7074 8407 8407 7074

after swapping right-id and left-id files:

$ mecab/bin/mecab-dict-index.exe -m mecab/dic/unidic-csj/model.bin -d mecab/dic/unidic-csj/ -f utf-8 -c utf-8 -a userdict.csv -u user2.csv
reading userdict.csv ... dictionary.cpp(212) [lid >= 0 && rid >= 0 && matrix.is_valid(lid, rid)] invalid ids are found lid=7151 rid=6170

swapping matrix.left_size() and matrix.right_size() in the assertions seems to fix it without breaking left/right id generation

however, building on my toolchain (mingw-w64 32-bit) was difficult, because the linker can't find wmain; I had to change winmain.h like so:

int main_wrapper(int argc, char **argv);

int main(int argc, char ** argv) {
    int argcount;
    LPWSTR commandline = GetCommandLineW();
    LPWSTR * arglist = CommandLineToArgvW(commandline, &argcount);
    
    CommandLine cmd(argcount, arglist);
    return main_wrapper(cmd.argc(), cmd.argv());
}
#define main(argc, argv) main_wrapper(argc, argv)

and remove -municode from configure

the diff I used to work around this bug is basically:

diff a/mecab/src/dictionary.cpp b/mecab/src/dictionary.cpp
--- a/mecab/src/dictionary.cpp
+++ b/mecab/src/dictionary.cpp
@@ -179,14 +179,14 @@ bool Dictionary::assignUserDictionaryCosts(

   cid.open(left_id_file.c_str(),
            right_id_file.c_str(), &config_iconv);
-  CHECK_DIE(cid.left_size()  == matrix.left_size() &&
-            cid.right_size() == matrix.right_size())
+  CHECK_DIE(cid.left_size()  == matrix.right_size() &&
+            cid.right_size() == matrix.left_size())
       << "Context ID files("
       << left_id_file
       << " or "
       << right_id_file << " may be broken: "
-      << cid.left_size() << " " << matrix.left_size() << " "
-      << cid.right_size() << " " << matrix.right_size();
+      << cid.left_size() << " " << matrix.right_size() << " "
+      << cid.right_size() << " " << matrix.left_size();

   std::ofstream ofs(output);
   CHECK_DIE(ofs) << "permission denied: " << output;
@@ -352,8 +352,8 @@ bool Dictionary::compile(const Param &param,
           cid.reset(new ContextID);
           cid->open(left_id_file.c_str(),
                     right_id_file.c_str(), &config_iconv);
-          CHECK_DIE(cid->left_size()  == matrix.left_size() &&
-                    cid->right_size() == matrix.right_size())
+          CHECK_DIE(cid->left_size()  == matrix.right_size() &&
+                    cid->right_size() == matrix.left_size())
               << "Context ID files("
               << left_id_file
               << " or "

Connector::open() checks for the "left" count at 0x0 into matrix.bin, and for the "right" count at 0x2, but the numbers and filesizes in unidic 2.3.0 are consistent with the "right" count coming first. (i.e. the first short in matrix.bin is larger than the second, but right-id.def is larger than and has more lines than left-id.def).

What's at fault here? The ContextID files do not appear to be named incorrectly (see the opening post of this issue - swapping their names results in an error in cost recalculation), and again, everything works just fine when a user dictionary is not being used.

lsize_ = static_cast<unsigned short>((*cmmap_)[0]);

explorer_2018-12-18_05-33-08

explorer_2018-12-18_05-33-25