A problem to train a model for Russian language
avostryakov opened this issue · 19 comments
-
I took 5 million lines from Russian wikipedia dump (extracted text), create alphabet_ru.txt with the following text (all files in UTF-8):
абвгдеёжзийклмнопрстуфхцчьъшщыэюя -
I trained a model:
~/JamSpell/build$ ./main/jamspell train alphabet_ru.txt ~/Downloads/xaa model_wiki_ru.bin
[info] loading text
[info] generating N-grams 0
[info] processed 0%
[info] generating keys
[info] ngrams1: 1592588
[info] ngrams2: 27563594
[info] ngrams3: 57626371
[info] total: 86782553
[info] generating perf hash
[info] finished, buckets: 108478199
[info] buckets filled
It looks like it was created without errors but when I tried to correct misspelled words it doesn't work:
import jamspell
corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('model_wiki_ru.bin')
corrector.FixFragment(u'Папа пощел погуоять в метро.')
corrector.GetCandidates([u'погуоять'], 0)
()
"пощел погуоять" weren't corrected! In the same time your small model correct these words!
I tried phrases with several completely corrupted words with zero effect, no correct, no suggestions
Where is my mistake?
Could you please upload somewhere your text file, your aplabet file and a result model? I'll try to reproduce.
a text file, an alphabet and a resulted model:
https://yadi.sk/d/bR-lGoul3RvABF
https://yadi.sk/i/lhOR_cTx3RvACe
https://yadi.sk/d/24oGPHpz3RvAKK
I tried to train model - everything is ok. But your model is not working. What is your OS, 32/64, compiler? Seems like there is some issues with model serialization, I tested on 64bit mac and linux.
Ubuntu 16.04 Desktop. 64bit I think. python2 + virtual env.
Logs when I compiled jamspell from source code:
cmake ..
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find GTest (missing: GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY)
-- Configuring done
-- Generating done
-- Build files have been written to: /home/antoly/JamSpell/build
/usr/bin/cc version:
COLLECT_GCC=/usr/bin/cc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1 16.04.4 --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
Thanks, i'll try to reproduce / fix on weekend. I localized the problem (it is with internal alphabet storing).
By the way, jamspell is not compiled under python3. Did you try to do it?
Could you please try folowing:
-
Update code (I added test for russian language recently), install jamspell for python and pytest and run following:
python2.7 -m pytest test_jamspell.py
And let me know - if test passed or failed. -
I was unable to find environment where it is reproduced (I checked locally on my mac, on my remote ubuntu16 server, on travis.ci - everywhere is ok). So could you please create a virtualbox image where it is reproducing and attach here? Or may be you can provide ssh access to your environment? My skype: filippfg, you can add me there.
I update the code, recompile jamspell from source. Here are results of tests:
`python -m pytest test_jamspell.py
================================================================== test session starts ==================================================================
platform linux2 -- Python 2.7.12, pytest-3.4.0, py-1.5.2, pluggy-0.6.0
rootdir: /home/antoly/JamSpell, inifile:
collected 2 items
test_jamspell.py FF [100%]
======================================================================= FAILURES ========================================================================
_____________________________________________ test_evaluation[sherlockholmes.txt-alphabet_en.txt-expected0] _____________________________________________
sourceFile = 'sherlockholmes.txt', alphabetFile = 'test_data/alphabet_en.txt'
expected = (0.04519985057900635, 0.7005163511187608, 0.014246804944479363, 0.01363466567052671, 0.7676419965576592)
@pytest.mark.parametrize('sourceFile,alphabetFile,expected', [
('sherlockholmes.txt', 'alphabet_en.txt', (0.04519985057900635, 0.7005163511187608, 0.014246804944479363,
0.01363466567052671, 0.7676419965576592)),
('kapitanskaya_dochka.txt', 'alphabet_ru.txt', (0.12330535829567463, 0.391304347826087, 0.03866565579984837,
0.05422853453841188, 0.4391304347826087)),
])
def test_evaluation(sourceFile, alphabetFile, expected):
alphabetFile = TEST_DATA + alphabetFile
generate_dataset.generateDatasetTxt(TEST_DATA + sourceFile, TEMP)
trainLangModel(TEMP_TRAIN, alphabetFile, TEMP_MODEL)
results = evaluateJamspell(TEMP_MODEL, TEMP_TEST, alphabetFile)
assert results == expected
E assert (0.1667911841...16137467, 0.0) == (0.04519985057...6419965576592)
E At index 0 diff: 0.16679118416137467 != 0.04519985057900635
E Use -v to get the full diff
test_jamspell.py:42: AssertionError
----------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------
[info] removing duplicates
[info] 9986 left
[info] shuffling
[info] saving train set
[info] saving test set
[info] done
----------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------
[info] loading text
[info] generating N-grams 0
[info] generating keys
[info] ngrams1: 13077
[info] ngrams2: 55608
[info] ngrams3: 86329
[info] total: 155014
[info] generating perf hash
[info] finished, buckets: 193771
[info] buckets filled
__________________________________________ test_evaluation[kapitanskaya_dochka.txt-alphabet_ru.txt-expected1] ___________________________________________
sourceFile = 'kapitanskaya_dochka.txt', alphabetFile = 'test_data/alphabet_ru.txt'
expected = (0.12330535829567463, 0.391304347826087, 0.03866565579984837, 0.05422853453841188, 0.4391304347826087)
@pytest.mark.parametrize('sourceFile,alphabetFile,expected', [
('sherlockholmes.txt', 'alphabet_en.txt', (0.04519985057900635, 0.7005163511187608, 0.014246804944479363,
0.01363466567052671, 0.7676419965576592)),
('kapitanskaya_dochka.txt', 'alphabet_ru.txt', (0.12330535829567463, 0.391304347826087, 0.03866565579984837,
0.05422853453841188, 0.4391304347826087)),
])
def test_evaluation(sourceFile, alphabetFile, expected):
alphabetFile = TEST_DATA + alphabetFile
generate_dataset.generateDatasetTxt(TEST_DATA + sourceFile, TEMP)
trainLangModel(TEMP_TRAIN, alphabetFile, TEMP_MODEL)
results = evaluateJamspell(TEMP_MODEL, TEMP_TEST, alphabetFile)
assert results == expected
E assert (0.2724338282...82763073, 0.0) == (0.12330535829...1304347826087)
E At index 0 diff: 0.2724338282763073 != 0.12330535829567463
E Use -v to get the full diff
test_jamspell.py:42: AssertionError
----------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------
[info] removing duplicates
[info] 802 left
[info] shuffling
[info] saving train set
[info] saving test set
[info] done
----------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------
[info] loading text
[info] generating N-grams 0
[info] generating keys
[info] ngrams1: 8499
[info] ngrams2: 23588
[info] ngrams3: 28200
[info] total: 60287
[info] generating perf hash
[info] finished, buckets: 75367
[info] buckets filled`
I installed all ubuntu last updates, restart computer; delete, clone and compile jamspell again. The same result above.
Same situation - training seems to end well, but the model isn't working
Any updates here?
Sory, currently I don't have any environment where it can be reproduced. Could you please prepare a virtualbox image that can reproduce this issue?
Ok, I'll try to do it on weekends.
Maybe this will help you: after training via ./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin
command I tried to run Jamspell right in the terminal.
I did this: ./main/jamspell correct model_sherlock.bin
It was successfully launched, but then all outputs were in hieroglyphs. Maybe there is a problem with encoding?
I've just built your project in CLion and launched the training process there. Ended well - now I have a working model, already tested it via python package.
I still think that the problem is somewhere in UTF8toWide, or vice versa.
I also have the same problem (hieroglyphs as output), @thelacker is there a difference when building the library in CLion and gcc?
I've trained the model using Clion with gcc as a compiler. I'm not sure that Clion makes the all thing work correctly, but it worked in my case.
Very strange. I have ubuntu 16.04, gcc 5.4.0
./build/main/jamspell train ./alphabet_en.txt sherlockholmes.txt model.bin info] loading text [info] generating N-grams 0 [info] generating keys [info] ngrams1: 10068 [info] ngrams2: 57804 [info] ngrams3: 93645 [info] total: 161517 [info] generating perf hash [info] finished, buckets: 201907 [info] buckets filled
`./build/main/jamspell correct ./model.bin
[info] loading model
[info] loaded
hello how
栀攀氀氀漀 栀漀眀`
@bakwc Any ideas why this can happen? Thx in advance!