A problem to train a model for Russian language

Question

A problem to train a model for Russian language

avostryakov opened this issue 7 years ago · 19 comments

I took 5 million lines from Russian wikipedia dump (extracted text), create alphabet_ru.txt with the following text (all files in UTF-8):
абвгдеёжзийклмнопрстуфхцчьъшщыэюя
I trained a model:
~/JamSpell/build$ ./main/jamspell train alphabet_ru.txt ~/Downloads/xaa model_wiki_ru.bin
[info] loading text
[info] generating N-grams 0
[info] processed 0%
[info] generating keys
[info] ngrams1: 1592588
[info] ngrams2: 27563594
[info] ngrams3: 57626371
[info] total: 86782553
[info] generating perf hash
[info] finished, buckets: 108478199
[info] buckets filled

It looks like it was created without errors but when I tried to correct misspelled words it doesn't work:

import jamspell
corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('model_wiki_ru.bin')
corrector.FixFragment(u'Папа пощел погуоять в метро.')

corrector.GetCandidates([u'погуоять'], 0)
()

"пощел погуоять" weren't corrected! In the same time your small model correct these words!
I tried phrases with several completely corrupted words with zero effect, no correct, no suggestions

Where is my mistake?

Answer 1 · 2018-01-30T11:25:29.000Z

Could you please upload somewhere your text file, your aplabet file and a result model? I'll try to reproduce.

Answer 2 · 2018-01-30T12:48:45.000Z

a text file, an alphabet and a resulted model:
https://yadi.sk/d/bR-lGoul3RvABF
https://yadi.sk/i/lhOR_cTx3RvACe
https://yadi.sk/d/24oGPHpz3RvAKK

Answer 3 · 2018-01-30T20:03:18.000Z

I tried to train model - everything is ok. But your model is not working. What is your OS, 32/64, compiler? Seems like there is some issues with model serialization, I tested on 64bit mac and linux.

Answer 4 · 2018-01-30T21:13:25.000Z

Ubuntu 16.04 Desktop. 64bit I think. python2 + virtual env.

Answer 5 · 2018-01-30T21:37:37.000Z

Logs when I compiled jamspell from source code:

cmake ..
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find GTest (missing: GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY)
-- Configuring done
-- Generating done
-- Build files have been written to: /home/antoly/JamSpell/build

/usr/bin/cc version:
COLLECT_GCC=/usr/bin/cc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1 16.04.4 --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

Answer 6 · 2018-01-31T09:58:56.000Z

Thanks, i'll try to reproduce / fix on weekend. I localized the problem (it is with internal alphabet storing).

Answer 7 · 2018-01-31T10:34:19.000Z

By the way, jamspell is not compiled under python3. Did you try to do it?

Answer 8 · 2018-02-04T20:02:47.000Z

Yep, it's compiling. You need to update repo, there was a fix in #2

Answer 9 · 2018-02-04T20:36:56.000Z

Could you please try folowing:

Update code (I added test for russian language recently), install jamspell for python and pytest and run following:
python2.7 -m pytest test_jamspell.py
And let me know - if test passed or failed.
I was unable to find environment where it is reproduced (I checked locally on my mac, on my remote ubuntu16 server, on travis.ci - everywhere is ok). So could you please create a virtualbox image where it is reproducing and attach here? Or may be you can provide ssh access to your environment? My skype: filippfg, you can add me there.

Answer 10 · 2018-02-05T07:06:45.000Z

I update the code, recompile jamspell from source. Here are results of tests:

`python -m pytest test_jamspell.py
================================================================== test session starts ==================================================================
platform linux2 -- Python 2.7.12, pytest-3.4.0, py-1.5.2, pluggy-0.6.0
rootdir: /home/antoly/JamSpell, inifile:
collected 2 items

test_jamspell.py FF [100%]

======================================================================= FAILURES ========================================================================
_____________________________________________ test_evaluation[sherlockholmes.txt-alphabet_en.txt-expected0] _____________________________________________

sourceFile = 'sherlockholmes.txt', alphabetFile = 'test_data/alphabet_en.txt'
expected = (0.04519985057900635, 0.7005163511187608, 0.014246804944479363, 0.01363466567052671, 0.7676419965576592)

@pytest.mark.parametrize('sourceFile,alphabetFile,expected', [
    ('sherlockholmes.txt', 'alphabet_en.txt', (0.04519985057900635, 0.7005163511187608, 0.014246804944479363,
                                               0.01363466567052671, 0.7676419965576592)),
    ('kapitanskaya_dochka.txt', 'alphabet_ru.txt', (0.12330535829567463, 0.391304347826087, 0.03866565579984837,
                                                    0.05422853453841188, 0.4391304347826087)),
])
def test_evaluation(sourceFile, alphabetFile, expected):
    alphabetFile = TEST_DATA + alphabetFile
    generate_dataset.generateDatasetTxt(TEST_DATA + sourceFile, TEMP)
    trainLangModel(TEMP_TRAIN, alphabetFile, TEMP_MODEL)
    results = evaluateJamspell(TEMP_MODEL, TEMP_TEST, alphabetFile)

  assert results == expected

E assert (0.1667911841...16137467, 0.0) == (0.04519985057...6419965576592)
E At index 0 diff: 0.16679118416137467 != 0.04519985057900635
E Use -v to get the full diff

test_jamspell.py:42: AssertionError
----------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------
[info] removing duplicates
[info] 9986 left
[info] shuffling
[info] saving train set
[info] saving test set
[info] done
----------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------
[info] loading text
[info] generating N-grams 0
[info] generating keys
[info] ngrams1: 13077
[info] ngrams2: 55608
[info] ngrams3: 86329
[info] total: 155014
[info] generating perf hash
[info] finished, buckets: 193771
[info] buckets filled
__________________________________________ test_evaluation[kapitanskaya_dochka.txt-alphabet_ru.txt-expected1] ___________________________________________

sourceFile = 'kapitanskaya_dochka.txt', alphabetFile = 'test_data/alphabet_ru.txt'
expected = (0.12330535829567463, 0.391304347826087, 0.03866565579984837, 0.05422853453841188, 0.4391304347826087)

@pytest.mark.parametrize('sourceFile,alphabetFile,expected', [
    ('sherlockholmes.txt', 'alphabet_en.txt', (0.04519985057900635, 0.7005163511187608, 0.014246804944479363,
                                               0.01363466567052671, 0.7676419965576592)),
    ('kapitanskaya_dochka.txt', 'alphabet_ru.txt', (0.12330535829567463, 0.391304347826087, 0.03866565579984837,
                                                    0.05422853453841188, 0.4391304347826087)),
])
def test_evaluation(sourceFile, alphabetFile, expected):
    alphabetFile = TEST_DATA + alphabetFile
    generate_dataset.generateDatasetTxt(TEST_DATA + sourceFile, TEMP)
    trainLangModel(TEMP_TRAIN, alphabetFile, TEMP_MODEL)
    results = evaluateJamspell(TEMP_MODEL, TEMP_TEST, alphabetFile)

  assert results == expected

E assert (0.2724338282...82763073, 0.0) == (0.12330535829...1304347826087)
E At index 0 diff: 0.2724338282763073 != 0.12330535829567463
E Use -v to get the full diff

test_jamspell.py:42: AssertionError
----------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------
[info] removing duplicates
[info] 802 left
[info] shuffling
[info] saving train set
[info] saving test set
[info] done
----------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------
[info] loading text
[info] generating N-grams 0
[info] generating keys
[info] ngrams1: 8499
[info] ngrams2: 23588
[info] ngrams3: 28200
[info] total: 60287
[info] generating perf hash
[info] finished, buckets: 75367
[info] buckets filled`

Answer 11 · 2018-02-05T07:38:24.000Z

I installed all ubuntu last updates, restart computer; delete, clone and compile jamspell again. The same result above.

Answer 12 · 2018-02-28T10:05:37.000Z

Same situation - training seems to end well, but the model isn't working
Any updates here?

Answer 13 · 2018-02-28T15:17:41.000Z

Sory, currently I don't have any environment where it can be reproduced. Could you please prepare a virtualbox image that can reproduce this issue?

Answer 14 · 2018-03-01T13:17:55.000Z

Ok, I'll try to do it on weekends.

Maybe this will help you: after training via ./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin command I tried to run Jamspell right in the terminal.
I did this: ./main/jamspell correct model_sherlock.bin
It was successfully launched, but then all outputs were in hieroglyphs. Maybe there is a problem with encoding?

Answer 15 · 2018-03-01T14:04:16.000Z

I've just built your project in CLion and launched the training process there. Ended well - now I have a working model, already tested it via python package.

I still think that the problem is somewhere in UTF8toWide, or vice versa.

Answer 16 · 2018-04-25T06:17:51.000Z

I also have the same problem (hieroglyphs as output), @thelacker is there a difference when building the library in CLion and gcc?

Answer 17 · 2018-04-25T06:20:48.000Z

I've trained the model using Clion with gcc as a compiler. I'm not sure that Clion makes the all thing work correctly, but it worked in my case.

Answer 18 · 2018-04-25T06:39:52.000Z

Very strange. I have ubuntu 16.04, gcc 5.4.0
./build/main/jamspell train ./alphabet_en.txt sherlockholmes.txt model.bin info] loading text [info] generating N-grams 0 [info] generating keys [info] ngrams1: 10068 [info] ngrams2: 57804 [info] ngrams3: 93645 [info] total: 161517 [info] generating perf hash [info] finished, buckets: 201907 [info] buckets filled

`./build/main/jamspell correct ./model.bin
[info] loading model
[info] loaded

hello how
栀攀氀氀漀栀漀眀`

Answer 19 · 2018-05-05T15:21:04.000Z

@bakwc Any ideas why this can happen? Thx in advance!