Newest Ichiran with newest data seems to be failing 31 tests
Closed this issue · 11 comments
Edict, Kanjidic2, jmdict-data, quicklisp and ichiran pulled from the Net yesterday.
Did full-init.
Had to comment out 2209300 additions in the errata, because the entire entry was deleted in jmdict. Then applied errata again.
macOS 13.6.1 Intel, Postgres and SBCL installed through Brew.
Results:
Unit Test Summary
| 707 assertions total
| 676 passed
| 31 failed
| 2 execution errors
| 0 missing tests
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("猫" "は" "しっぽ" "を" "ぴんと" "立てて" "歩いた")
| but saw ("猫" "は" "しっぽ" "を" "ぴんと立てて" "歩いた")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("わかりきった") but saw ("わ" "かりきった")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("おとめ" "に" "ふさわしい" "振る舞い") but saw ("お" "とめ" "に" "ふさわしい" "振る舞い")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("折りたたみ" "式" "ついたて") but saw ("折りたたみ" "式" "ついた" "て")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("使い物" "に" "ならん" "だろ") but saw ("使い" "物にならん" "だろ")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("雪" "が" "ない" "ため") but saw ("雪" "が" "な" "いため")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("バラしちゃってる") but saw ("バラ" "しちゃってる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("何も" "口" "に" "せぬ") but saw ("何も" "口" "にせぬ")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("工夫" "が" "される") but saw ("工夫" "がされる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("だめ" "だったら") but saw ("だ" "めだったら")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("彼女" "は" "苦しげ" "に" "うめいて" "横たわった")
| but saw ("彼女" "は" "苦しげ" "に" "うめ" "いて" "横たわった")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("共感" "性") but saw ("共感性")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("それ" "ただ" "の" "怪しい" "人" "です" "し")
| but saw ("それた" "だの" "怪しい" "人" "です" "し")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("出したい" "とき" "は") but saw ("出した" "いと" "き" "は")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("旅行" "に" "いきたい") but saw ("旅行" "にい" "きたい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("しない" "かい") but saw ("し" "ないかい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("てか" "最近" "ファン" "層" "は" "円盤" "すら" "買わない" "から" "そいつら" "から" "金" "とる"
"ってのは" "無謀")
| but saw ("てか" "最近" "ファン層" "は" "円盤" "すら" "買わない" "から" "そいつら" "から" "金" "とる" "ってのは"
"無謀")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("なんというか" "すみません") but saw ("なんという" "かすみません")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("そう" "したい" "から" "した" "だけ" "だ") but saw ("そうした" "いからした" "だけ" "だ")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("手にとって" "いただき" "やすくなる") but saw ("手にとっていた" "だ" "きやすくなる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("大事" "に" "なります") but saw ("大" "事になります")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("奴" "が" "まとも" "に" "見られない") but saw ("奴" "が" "まともに" "見られない")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("といった" "ところ" "でしょうか") but saw ("と" "いった" "ところ" "でしょうか")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("言い方" "も" "します") but saw ("言い方" "もします")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("届け" "したら") but saw ("届" "けしたら")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("全く" "と" "いって" "いい") but saw ("全く" "と" "いっていい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("仲良し" "に" "なったら") but saw ("仲良し" "になったら")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("体" "に" "悪い" "と" "知り" "ながら" "タバコをやめる" "こと" "は" "できない")
| but saw ("体に悪い" "と" "知り" "ながら" "タバコをやめる" "こと" "は" "できない")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("雨" "が" "降りそう" "な" "気がします") but saw ("雨が降りそう" "な" "気がします")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("そういう" "お" "隣" "どうし") but saw ("そういう" "お" "隣どうし")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("みんな" "土足で" "おいで") but saw ("みんな" "土足で" "おい" "で")
|
SEGMENTATION-TEST: 451 assertions passed, 31 failed, and an execution error.
| Execution error:
| Database error 42P01: relation "kanji" does not exist
QUERY: (SELECT r.text, r.type FROM kanji AS k INNER JOIN reading AS r ON (r.kanji_id = k.id) WHERE ((k.text = E'取') and (not (r.type IN (E'ja_na')))))
|
MATCH-READINGS-TEST: 0 assertions passed, 0 failed, and an execution error.
| Execution error:
| Database error 42P01: relation "kanji" does not exist
QUERY: (SELECT r.text, r.type FROM kanji AS k INNER JOIN reading AS r ON (r.kanji_id = k.id) WHERE ((k.text = E'気') and (not (r.type IN (E'ja_na')))))
|
SEGMENTATION-TEST: 451 assertions passed, 31 failed, and an execution error.
#<TEST-RESULTS-DB Total(707) Passed(676) Failed(31) Errors(2)>
Hi, unfortunately because JMdict data always changes it's impossible to segmentation tests to always pass unless they're modified and the code has been manually calibrated. For that reason only the latest release is guaranteed to actually pass all the tests.
For example
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("猫" "は" "しっぽ" "を" "ぴんと" "立てて" "歩いた")
| but saw ("猫" "は" "しっぽ" "を" "ぴんと立てて" "歩いた")
This test failure is caused by the word ぴんと立つ being added to JMdict database on 2022-07-19. Since the latest release of Ichiran was in January 2022 the test doesn't use this word for segmentation.
As for this,
Database error 42P01: relation "kanji" does not exist
check that you have downloaded file kanjidic2.xml and specified a path to it in settings. Try manually running the following functions:
(ichiran/mnt:load-kanjidic)
(ichiran/mnt:load-kanji-stats)
I understand; so the example answer is actually better than expected one, given the current state of JMDict, and what's expected needs to be adjusted.
As for kanjidic2.xml, I have it and the path is correct.
- (ichiran/mnt:load-kanjidic)
500 entries loaded
1000 entries loaded
1500 entries loaded
2000 entries loaded
2500 entries loaded
3000 entries loaded
3500 entries loaded
4000 entries loaded
4500 entries loaded
5000 entries loaded
5500 entries loaded
6000 entries loaded
6500 entries loaded
7000 entries loaded
7500 entries loaded
8000 entries loaded
8500 entries loaded
9000 entries loaded
9500 entries loaded
10000 entries loaded
10500 entries loaded
11000 entries loaded
11500 entries loaded
12000 entries loaded
12500 entries loaded
13000 entries loaded
13109 entries total
NIL- (ichiran/mnt:load-kanji-stats)
100 kanji processed
200 kanji processed
300 kanji processed
400 kanji processed
500 kanji processed
600 kanji processed
700 kanji processed
800 kanji processed
900 kanji processed
1000 kanji processed
1100 kanji processed
1200 kanji processed
1300 kanji processed
1400 kanji processed
1500 kanji processed
1600 kanji processed
1700 kanji processed
1800 kanji processed
1900 kanji processed
2000 kanji processed
2100 kanji processed
2136 kanji total
NIL
I did that right now, but it should have executed earlier as well as part of full-init, so I have to assume these were already loaded and calculated when I ran tests previously. I can't run tests again at the moment to confirm that it's still there though, as in the meantime I added in some logging to better understand ho it works, and the side-effect seems to be that the tests lock up mid-way. I think it's possible some other change to JMDict or KanjiDic might be causing the earlier error though.
I repeated the procedure on a fresh database, and the 'kanji' error didn't show up. So indeed, most likely the kanjidic2 database hadn't been loaded despite full-init having finished execution, and the kanjidic2 path being already provided to it before it started.
A mystery, but apparently no longer reproducible.
It's still failing the same 31 tests, but it's expected. Closing.
I think the first time it failed on add-errata because the word in question was deleted from JMdict (due to my comment in fact...), I'll try to make it work with the latest data in the coming weeks.
debugger invoked on a CL-POSTGRES-ERROR:FOREIGN-KEY-VIOLATION in thread
#<THREAD "main thread" RUNNING {1001870103}>:
Database error 23503: insert or update on table "kana_text" violates foreign key constraint "kana_text_entry_seq_foreign"
DETAIL: Key (seq)=(2209300) is not present in table "entry".
QUERY: INSERT INTO kana_text (best_kanji, nokanji, conjugate_p, common_tags, common, ord, text, seq) VALUES (NULL, false, true, E'', NULL, 0, E'たへる', 2209300) RETURNING id
Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name):
0: [ABORT] Exit debugger, returning to top level.
(CL-POSTGRES::GET-ERROR #<SB-SYS:FD-STREAM for "socket 127.0.0.1:54562, peer: 127.0.0.1:5432" {1001B65323}>)
source: (ERROR (CL-POSTGRES-ERROR::GET-ERROR-TYPE CODE) :CODE CODE :MESSAGE
(GET-FIELD #\M) :DETAIL (GET-FIELD #\D) :HINT (GET-FIELD #\H)
:CONTEXT (GET-FIELD #\W) ...)
I reinitialized the entire database, and indeed, it turned out that there had been lingering side-effects of that crash (notably n-kanji and n-kana in many conjugations were left at 0, which wasn't causing crashing, but was causing trouble with scoring).
After the reinitialisation, it only fails on 13 tests:
Unit Test Summary
| 748 assertions total
| 735 passed
| 13 failed
| 0 execution errors
| 0 missing tests
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("だしといて") but saw ("だし" "といて")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("猫" "は" "しっぽ" "を" "ぴんと" "立てて" "歩いた")
| but saw ("猫" "は" "しっぽ" "を" "ぴんと立てて" "歩いた")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("おとめ" "に" "ふさわしい" "振る舞い") but saw ("お" "とめ" "に" "ふさわしい" "振る舞い")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("バラしちゃってる") but saw ("バラ" "しちゃってる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("ガス" "が" "ついている") but saw ("ガス" "が" "ついて" "いる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("工夫" "が" "される") but saw ("工夫" "がされる")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("共感" "性") but saw ("共感性")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("しない" "かい") but saw ("し" "ないかい")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("てか" "最近" "ファン" "層" "は" "円盤" "すら" "買わない" "から" "そいつら" "から" "金" "とる"
"ってのは" "無謀")
| but saw ("てか" "最近" "ファン層" "は" "円盤" "すら" "買わない" "から" "そいつら" "から" "金" "とる" "ってのは"
"無謀")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("奴" "が" "まとも" "に" "見られない") but saw ("奴" "が" "まともに" "見られない")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("体" "に" "悪い" "と" "知り" "ながら" "タバコをやめる" "こと" "は" "できない")
| but saw ("体に悪い" "と" "知り" "ながら" "タバコをやめる" "こと" "は" "できない")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("雨" "が" "降りそう" "な" "気がします") but saw ("雨が降りそう" "な" "気がします")
|
| Failed Form: ICHIRAN/TEST::RESULT
| Expected ("そういう" "お" "隣" "どうし") but saw ("そういう" "お" "隣どうし")
|
SEGMENTATION-TEST: 497 assertions passed, 13 failed.
#<TEST-RESULTS-DB Total(748) Passed(735) Failed(13) Errors(0)>
dec23 branch contains code which should pass all tests on recent JMdict dumps (make sure to run (add-errata)
after updating). I'll make a new release soon unless there are some terrible issues with it (haven't tested this version much yet).
I just got around to doing it, and full-init seems to be failing very early on:
* (ichiran/maintenance:full-init)
Initializing ichiran/dict...
debugger invoked on a CL-POSTGRES-ERROR:UNIQUE-VIOLATION in thread
#<THREAD "main thread" RUNNING {10010C0093}>:
Database error 23505: duplicate key value violates unique constraint "entry_pkey"
DETAIL: Key (seq)=(1000280) already exists.
QUERY: INSERT INTO entry (primary_nokanji, n_kana, n_kanji, root_p, content, seq) VALUES (false, 0, 0, true, E'<?xml version="1.0" encoding="UTF-8"?>
<entry>
<ent_seq>1000280</ent_seq>
<k_ele>
<keb>論う</keb>
</k_ele>
<r_ele>
<reb>あげつらう</reb>
</r_ele>
<sense>
<pos>v5u</pos>
<pos>vt</pos>
<misc>uk</misc>
<gloss xml:lang="eng">to discuss</gloss>
</sense>
<sense>
<pos>v5u</pos>
<pos>vt</pos>
<gloss xml:lang="eng">to find fault with</gloss>
<gloss xml:lang="eng">to criticize</gloss>
<gloss xml:lang="eng">to criticise</gloss>
</sense>
</entry>', 1000280)
Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name):
0: [ABORT] Exit debugger, returning to top level.
(CL-POSTGRES::GET-ERROR #<SB-SYS:FD-STREAM for "socket 127.0.0.1:55237, peer: 127.0.0.1:5432" {100EF410F3}>)
source: (ERROR (CL-POSTGRES-ERROR::GET-ERROR-TYPE CODE) :CODE CODE :MESSAGE
(GET-FIELD #\M) :DETAIL (GET-FIELD #\D) :HINT (GET-FIELD #\H)
:CONTEXT (GET-FIELD #\W) ...)
0] 0
*
Previous master worked correctly with the same JMDict file from around the middle of December, so I think some code change must have caused this...
To be sure, I downloaded the newest JMdict_e today's one, and tried with it, but that didn't fix anything, same crash.
Very strange, it's supposed to be dropping the tables at the beginning of full-init, and it seems impossible for the xml file to have a duplicated entry...
Maybe I should have tried just add-errata first, but I wanted to be sure it's all reset. Now I also can't try add-errata anymore, since full-init deleted the tables.
Actually nevermind that. This is related to a change I made to load-entry
to auto-conjugate words from data/extra.xml
EDIT: just pushed a fix to the branch
Your last fix seems to have fixed that one. full-init now gets as far as the "Loading custom data..." before crashing:
Loading custom data...
debugger invoked on a CXML:WELL-FORMEDNESS-VIOLATION in thread
#<THREAD "main thread" RUNNING {10010E8093}>:
Document not well-formed: Bad attribute value delimiter #\\, must be either #\" or #\'.
Location:
Line 44, column 24 in NIL
Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name):
0: [ABORT] Exit debugger, returning to top level.
(CXML::%ERROR CXML:WELL-FORMEDNESS-VIOLATION #<RUNES:XSTREAM [main document :MAIN NIL]> "Document not well-formed: Bad attribute value delimiter #\\\\, must be either #\\\" or #\\'.")
source: (ERROR CLASS :FORMAT-CONTROL "~A" :FORMAT-ARGUMENTS
(LIST (GET-OUTPUT-STREAM-STRING S)))
0]
EDIT: I am going to assume the problem is that "eng" in two last seqs in extra.xml is escaped, unlike "eng" in old content in there, and edit that and restart full-init.
yeah the xml file was corrupted, I fixed and added a test for it