Not working with Chinese model cc.zh.300.bin (from cc.zh.300.vec.gz) on Windows 10

Question

Not working with Chinese model cc.zh.300.bin (from cc.zh.300.vec.gz) on Windows 10

Closed this issue a month ago · 6 comments

AnabasisXu commented 2 months ago

First of all, thank you so much for multi language support.

I have followed the instruction by doing 4 things. No errors was reported, but it did not work with Chinese.

I have prepared a 1.txt file with UTF-8 conding and made sure that grep command will find the keyword in the file.

Interestingly, the Chinese model works for the English test.

Build fasttext-to-bin.go

❯ go build fasttext-to-bin.go

Conversion to cc.zh.300.bin

❯ gunzip -c cc.zh.300.vec.gz | ./fasttext-to-bin -input - -output ./cc.zh.300.bin
Conversion complete. Word2Vec binary model saved as ./cc.zh.300.bin

Change the config.json

Administrator in CS\cml\semantic-grep via 🐹 v1.19.1 took 53s
❯ cat "G:\CS\cml\semantic-grep\config.json"
───────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: G:\CS\cml\semantic-grep\config.json
───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ {
   2   │     "model_path": "G:/CS/cml/semantic-grep/cc.zh.300.bin"
   3   │ }
───────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Tests


Administrator in CS\cml\semantic-grep via 🐹 v1.19.1
❯ ./sgrep  -C 2 -n -threshold 0.55 '合理性' 1.txt
Using configuration file: G:\CS\cml\semantic-grep\config.json

❯ grep '合理性' 1.txt
"更多的证据 "的一种可能性是对合理演绎推断的保真属性的一种类推：正如在真实的前提下进行合理演绎本身也可以确保为真一样，从真实的观察进行的合理归纳也应该是真实的，至少在证据不断增加的限制下应该如此。然而，这只是对我们的推断程序是具有连续性的基本要求。如上文所述，使用贝叶斯法则并不是确保一致性的充分条件，也不是必要条件。事实上，我们所知道的每一个关于贝叶斯一致性的证明，要么是假设对同一问题有一个具有一致性的非贝叶斯程序，要么是做了其他的假设而这些假设中包含了这样一个具有一致性的非贝叶斯程序的存在。在任何情况下，建立了统计程序一致性的定理都会确保这些程序的演绎合理性

Test the Chinese model by sgrep the word glory in hm.txt (Old Man & Sea)

❯ ./sgrep  -C 2 -n -threshold 0.55 glory hm.txt
Using configuration file: G:\CS\cml\semantic-grep\config.json
Similarity: 1.0000
1463:not know he was so big."
1464:
1465:"I'll kill him though," he said.  "In all his greatness and his glory."
1466:
1467:Although it is unjust, he thought.  But I will show him what a man can
--

Answer 1 · 2024-08-01T14:19:31.000Z

Thanks for trying out ~~sgrep~~ w2vgrep.

For your issue, sounds like the program is using the wrong model. Could you please try the following:

Clone the latest repo (I fixed a few issues today)
Double check that the model is being made correctly by fasttext-to-bin. The md5sum I am getting for the processed model is md5sum cc.zh.300.bin 67af1742fa5c1c0fe20cf68aa4447cfb
Try running the program with the model path in the command-line:
curl -s https://www.gutenberg.org/cache/epub/25328/pg25328.txt | w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin 燒

Please let me know if this fixes the issue. If not, I have to think harder.

Answer 2 · 2024-08-02T12:55:54.000Z

Thanks!
I updated to version 0.6, and have made sure that md5sum is right:

Administrator in CS\cml\semantic-grep via 🐹 v1.19.1
❯ md5sum cc.zh.300.bin
67af1742fa5c1c0fe20cf68aa4447cfb *cc.zh.300.bin

The sample comand in your instruction worked
curl -s https://www.gutenberg.org/cache/epub/25328/pg25328.txt | w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin 燒

However, it only shows results with Similarity: 1.0000. I have tried other Chinese characters and the result is all the same. So the model cc.zh.300.bin is not doing its job?

I also tried to use w2vgrep on local files without curl, but my commnds did not work. I expected it would at least show what a grep command will show, as the target word 合理性 is indded in the 1.txt.

Administrator in CS\cml\semantic-grep via 🐹 v1.19.1 took 52s
❯ ./w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin 合理性 1.txt
Using configuration file: G:\CS\cml\semantic-grep\config.json

Administrator in CS\cml\semantic-grep via 🐹 v1.19.1 took 55s
❯ cat 1.txt |./w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin 合理性
Using configuration file: G:\CS\cml\semantic-grep\config.json

```


[1.txt](https://github.com/user-attachments/files/16470703/1.txt)

Answer 3 · 2024-08-02T14:38:57.000Z

The default threshold is 0.7, which may be high for your use case. Try lowering it: curl -s https://www.gutenberg.org/cache/epub/25328/pg25328.txt | w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin --threshold=0.3 燒. This is finding me (見 0.4776), (頭 0.4200, (鶴 0.3543). I do not know the language to know if these are good matches.

To help troubleshoot the model, I added a synonym-finder.go to ./model_processing_utils/. This program will find similar words to the query word above any threshold in the model.

# build
cd model_processing_utils
go build synonym-finder.go

#run
synonym-finder -model_path path/to/cc.zh.300.bin -threshold 0.5 合理性

# Output:
Words similar to '合理性' with similarity >= 0.50:
妥当性 0.5745
周延性 0.5535
客观性 0.5030
可操作性 0.5053
合理性 1.0000
一致性 0.5334
完善性 0.5656
公正性 0.5245
效用性 0.5316
证立 0.5147
自洽性 0.5008
正当性 0.6018
必要性 0.6499
公允性 0.6152
可行性 0.5923
不合理性 0.6094
效益性 0.5529
合法性 0.6219
应然性 0.5709
不合理 0.5412
正當性與 0.5173
正确性 0.5537
合理 0.5808
可接受性 0.5151
科学性 0.6304
论证 0.5379
实证性 0.5216
有效性 0.6374
公平性 0.5250
周密性 0.5292
充分性 0.5156
吻合性 0.5006
恰当性 0.5426
必然性 0.5574
适度性 0.5401
相似性 0.5101
完备性 0.5060

If these don't solve your issue, please get back to me as before.

Answer 4 · 2024-08-03T01:42:50.000Z

Hi arunsupe, thanks for developing synonym-finder!
As shown in your reply, we can find quite a few synonyms of 合理性 that are above 0.5 threshold. Just show the top 4 of the result.

Administrator in CS\cml\semantic-grep via 🐹 v1.19.1
❯ ./synonym-finder.exe --model_path=cc.zh.300.bin -threshold 0.5 合理性
Words similar to '合理性' with similarity >= 0.50:
周延性 0.5535
完善性 0.5656
周密性 0.5292
合理 0.5808

Yet querying 合理性 against 1.txt with a threshold of 0.5 hits no match. Tried lowering to 0.3, it did show a lot of results, though the quality is terrible (basically equals grep 性 1.txt).

Administrator in CS\cml\semantic-grep via 🐹 v1.19.1 took 52s
❯ ./w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin  --threshold=0.5 '合理性' 1.txt
Using configuration file: G:\CS\cml\semantic-grep\config.json

But I have find a workaround by passing the result of synonym-finder to grep:
I use awk to apply each synonym to a grep synonym 1.txt command, and if there is any result of grep, echo its corresponding similarity.
As the results are long, I limit to one match. I would say the quality of matches is quite good.

Administrator in CS\cml\semantic-grep via 🐹 v1.19.1 took 8s
❯ ./synonym-finder.exe --model_path=cc.zh.300.bin -threshold 0.5 合理性  | awk 'NR > 1 {print $1, $2}' | while read -r word similarity; do
    if grep --color=auto -q "$word" 1.txt; then
        echo "\"$word\" Similarity: $similarity"
        grep --color=auto "$word" 1.txt
        echo ""
    fi
done
"可操作性" Similarity: 0.5053
由于我们把先验分布看作是贝叶斯模型的一个可检验的部分，我们不需要按照Jaynes的理论为每种情况设计一个独特的、客观正确的先验分布——关于这种做法的记录并不令人振奋(Kass & Wasserman, 1996)，刚不用说很多作者对Jaynes这一具体观点持怀疑态度（Seidenfeld, 1979, 1987; Csisz´ar, 1995; Uffink, 1995, 1996)。简而言之，对于贝叶斯主义者来说，"模型 "是先验分布和似然的组合，其中每一个都代表了科学知识、数学上的便利和计算上的可操作性之间的某种妥协。
贝叶斯非参数化模型中的不确定性表示方式是一个技术角度但又非常重要的问题。在有限维的问题中，使用后验分布来表示不确定性在一定程度上得到了Bernstein-von Mises现象的支持，其确保了对于大样本而言，可信区域也是置信区域。在无限维情况下这一点完全失效(Cox,1993；Freedman,1999)，因此继续天真地使用后验分布是不明智的。(由于我们把先验分布和后验分布视为正则化工具，这对我们来说并不特别麻烦）与此相关的是，贝叶斯非参数模型中的先验分布是一个随机过程，总是基于可操作性而选择(Ghosh & Ramamoorthi, 2003; Hjort et al., 2010)，因此放弃了任何试图代表实际询问者信念的伪装。

Would be great if w2vgrep works by itself.
Also, synonym-finder takes a while to find the result. Maybe may 11 year old Windows is just too show.

./synonym-finder.exe --model_path=cc.zh.300.bin -threshold 0.5 合理性  0.00s user 0.01s system 0% cpu 54.693 total

Answer 5 · 2024-08-03T02:36:14.000Z

Odd. synonym-finder uses the same logic functions as w2vgrep. I just deleted unused functions from w2vgrep and am looping through the model's words rather than input text. (The model is basically a dictionary mapping word -> word vector. The word vector is 300 32 bit floats).

The performance bottlenecks are: 1. loading the 2GB model file into memory 2. multiplying 300 floating point numbers for each word comparison. Possible optimizations:

decrease the number of words in the model (FB's models have 2,000,000 words. A smaller number may do. Reducing this to just words people care about)
change model vectors from 300 x 32bit to 300 x 8bit - use 8 bit ints instead of 32 bit floats. Model size will reduce to 25%. But, accuracy will decrease.

I am thinking of implementing 32 bit to 8 bit conversion for the next iteration.

Thanks again for giving me feedback. I am going to close this issue. Keep an eye out for the 8bit models. Will help your performance.

Answer 6 · 2024-08-03T09:06:42.000Z

Thank you and I look forward to the next iteration.