bakwc/JamSpell

A problem to train a model for Russian language

avostryakov opened this issue · 19 comments

  1. I took 5 million lines from Russian wikipedia dump (extracted text), create alphabet_ru.txt with the following text (all files in UTF-8):
    абвгдеёжзийклмнопрстуфхцчьъшщыэюя

  2. I trained a model:
    ~/JamSpell/build$ ./main/jamspell train alphabet_ru.txt ~/Downloads/xaa model_wiki_ru.bin
    [info] loading text
    [info] generating N-grams 0
    [info] processed 0%
    [info] generating keys
    [info] ngrams1: 1592588
    [info] ngrams2: 27563594
    [info] ngrams3: 57626371
    [info] total: 86782553
    [info] generating perf hash
    [info] finished, buckets: 108478199
    [info] buckets filled

It looks like it was created without errors but when I tried to correct misspelled words it doesn't work:

import jamspell
corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('model_wiki_ru.bin')
corrector.FixFragment(u'Папа пощел погуоять в метро.')

corrector.GetCandidates([u'погуоять'], 0)
()

"пощел погуоять" weren't corrected! In the same time your small model correct these words!
I tried phrases with several completely corrupted words with zero effect, no correct, no suggestions

Where is my mistake?

bakwc commented

Could you please upload somewhere your text file, your aplabet file and a result model? I'll try to reproduce.

bakwc commented

I tried to train model - everything is ok. But your model is not working. What is your OS, 32/64, compiler? Seems like there is some issues with model serialization, I tested on 64bit mac and linux.

Ubuntu 16.04 Desktop. 64bit I think. python2 + virtual env.

Logs when I compiled jamspell from source code:

cmake ..
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find GTest (missing: GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY)
-- Configuring done
-- Generating done
-- Build files have been written to: /home/antoly/JamSpell/build

/usr/bin/cc version:
COLLECT_GCC=/usr/bin/cc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1 16.04.4 --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

bakwc commented

Thanks, i'll try to reproduce / fix on weekend. I localized the problem (it is with internal alphabet storing).

By the way, jamspell is not compiled under python3. Did you try to do it?

bakwc commented

Yep, it's compiling. You need to update repo, there was a fix in #2

bakwc commented

Could you please try folowing:

  1. Update code (I added test for russian language recently), install jamspell for python and pytest and run following:
    python2.7 -m pytest test_jamspell.py
    And let me know - if test passed or failed.

  2. I was unable to find environment where it is reproduced (I checked locally on my mac, on my remote ubuntu16 server, on travis.ci - everywhere is ok). So could you please create a virtualbox image where it is reproducing and attach here? Or may be you can provide ssh access to your environment? My skype: filippfg, you can add me there.

I update the code, recompile jamspell from source. Here are results of tests:

`python -m pytest test_jamspell.py
================================================================== test session starts ==================================================================
platform linux2 -- Python 2.7.12, pytest-3.4.0, py-1.5.2, pluggy-0.6.0
rootdir: /home/antoly/JamSpell, inifile:
collected 2 items

test_jamspell.py FF [100%]

======================================================================= FAILURES ========================================================================
_____________________________________________ test_evaluation[sherlockholmes.txt-alphabet_en.txt-expected0] _____________________________________________

sourceFile = 'sherlockholmes.txt', alphabetFile = 'test_data/alphabet_en.txt'
expected = (0.04519985057900635, 0.7005163511187608, 0.014246804944479363, 0.01363466567052671, 0.7676419965576592)

@pytest.mark.parametrize('sourceFile,alphabetFile,expected', [
    ('sherlockholmes.txt', 'alphabet_en.txt', (0.04519985057900635, 0.7005163511187608, 0.014246804944479363,
                                               0.01363466567052671, 0.7676419965576592)),
    ('kapitanskaya_dochka.txt', 'alphabet_ru.txt', (0.12330535829567463, 0.391304347826087, 0.03866565579984837,
                                                    0.05422853453841188, 0.4391304347826087)),
])
def test_evaluation(sourceFile, alphabetFile, expected):
    alphabetFile = TEST_DATA + alphabetFile
    generate_dataset.generateDatasetTxt(TEST_DATA + sourceFile, TEMP)
    trainLangModel(TEMP_TRAIN, alphabetFile, TEMP_MODEL)
    results = evaluateJamspell(TEMP_MODEL, TEMP_TEST, alphabetFile)
  assert results == expected

E assert (0.1667911841...16137467, 0.0) == (0.04519985057...6419965576592)
E At index 0 diff: 0.16679118416137467 != 0.04519985057900635
E Use -v to get the full diff

test_jamspell.py:42: AssertionError
----------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------
[info] removing duplicates
[info] 9986 left
[info] shuffling
[info] saving train set
[info] saving test set
[info] done
----------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------
[info] loading text
[info] generating N-grams 0
[info] generating keys
[info] ngrams1: 13077
[info] ngrams2: 55608
[info] ngrams3: 86329
[info] total: 155014
[info] generating perf hash
[info] finished, buckets: 193771
[info] buckets filled
__________________________________________ test_evaluation[kapitanskaya_dochka.txt-alphabet_ru.txt-expected1] ___________________________________________

sourceFile = 'kapitanskaya_dochka.txt', alphabetFile = 'test_data/alphabet_ru.txt'
expected = (0.12330535829567463, 0.391304347826087, 0.03866565579984837, 0.05422853453841188, 0.4391304347826087)

@pytest.mark.parametrize('sourceFile,alphabetFile,expected', [
    ('sherlockholmes.txt', 'alphabet_en.txt', (0.04519985057900635, 0.7005163511187608, 0.014246804944479363,
                                               0.01363466567052671, 0.7676419965576592)),
    ('kapitanskaya_dochka.txt', 'alphabet_ru.txt', (0.12330535829567463, 0.391304347826087, 0.03866565579984837,
                                                    0.05422853453841188, 0.4391304347826087)),
])
def test_evaluation(sourceFile, alphabetFile, expected):
    alphabetFile = TEST_DATA + alphabetFile
    generate_dataset.generateDatasetTxt(TEST_DATA + sourceFile, TEMP)
    trainLangModel(TEMP_TRAIN, alphabetFile, TEMP_MODEL)
    results = evaluateJamspell(TEMP_MODEL, TEMP_TEST, alphabetFile)
  assert results == expected

E assert (0.2724338282...82763073, 0.0) == (0.12330535829...1304347826087)
E At index 0 diff: 0.2724338282763073 != 0.12330535829567463
E Use -v to get the full diff

test_jamspell.py:42: AssertionError
----------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------
[info] removing duplicates
[info] 802 left
[info] shuffling
[info] saving train set
[info] saving test set
[info] done
----------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------
[info] loading text
[info] generating N-grams 0
[info] generating keys
[info] ngrams1: 8499
[info] ngrams2: 23588
[info] ngrams3: 28200
[info] total: 60287
[info] generating perf hash
[info] finished, buckets: 75367
[info] buckets filled`

I installed all ubuntu last updates, restart computer; delete, clone and compile jamspell again. The same result above.

Same situation - training seems to end well, but the model isn't working
Any updates here?

bakwc commented

Sory, currently I don't have any environment where it can be reproduced. Could you please prepare a virtualbox image that can reproduce this issue?

Ok, I'll try to do it on weekends.

Maybe this will help you: after training via ./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin command I tried to run Jamspell right in the terminal.
I did this: ./main/jamspell correct model_sherlock.bin
It was successfully launched, but then all outputs were in hieroglyphs. Maybe there is a problem with encoding?
2018-03-01 16 14 12

I've just built your project in CLion and launched the training process there. Ended well - now I have a working model, already tested it via python package.

I still think that the problem is somewhere in UTF8toWide, or vice versa.

I also have the same problem (hieroglyphs as output), @thelacker is there a difference when building the library in CLion and gcc?

I've trained the model using Clion with gcc as a compiler. I'm not sure that Clion makes the all thing work correctly, but it worked in my case.

Very strange. I have ubuntu 16.04, gcc 5.4.0
./build/main/jamspell train ./alphabet_en.txt sherlockholmes.txt model.bin info] loading text [info] generating N-grams 0 [info] generating keys [info] ngrams1: 10068 [info] ngrams2: 57804 [info] ngrams3: 93645 [info] total: 161517 [info] generating perf hash [info] finished, buckets: 201907 [info] buckets filled

`./build/main/jamspell correct ./model.bin
[info] loading model
[info] loaded

hello how
栀攀氀氀漀 栀漀眀`

@bakwc Any ideas why this can happen? Thx in advance!