ipadic problem for 四半期連結会計期間末日満期手形
Opened this issue · 5 comments
Thank you for releasing bert-small-japanese-fin and other Electra models for FinTech. But I've found they tokenize "四半期連結会計期間末日満期手形" in bad way:
>>> from transformers import AutoTokenizer
>>> tokenizer=AutoTokenizer.from_pretrained("izumi-lab/bert-small-japanese-fin")
>>> tokenizer.tokenize("四半期連結会計期間末日満期手形")
['四半期', '連結', '会計', '期間', '末日', '満期', '手形']
>>> tokenizer.tokenize("第3四半期連結会計期間末日満期手形")
['第', '3', '四半期連結会計期間末日満期手形']
This is because of the bug of ipadic
on 名詞,数 tokenization for 漢字-strings which begin with 漢数字.
>>> import fugashi,ipadic
>>> parser=fugashi.GenericTagger(ipadic.MECAB_ARGS).parse
>>> print(parser("四半期連結会計期間末日満期手形"))
四半期 名詞,一般,*,*,*,*,四半期,シハンキ,シハンキ
連結 名詞,サ変接続,*,*,*,*,連結,レンケツ,レンケツ
会計 名詞,サ変接続,*,*,*,*,会計,カイケイ,カイケイ
期間 名詞,一般,*,*,*,*,期間,キカン,キカン
末日 名詞,一般,*,*,*,*,末日,マツジツ,マツジツ
満期 名詞,一般,*,*,*,*,満期,マンキ,マンキ
手形 名詞,一般,*,*,*,*,手形,テガタ,テガタ
EOS
>>> print(parser("第3四半期連結会計期間末日満期手形"))
第 接頭詞,数接続,*,*,*,*,第,ダイ,ダイ
3 名詞,数,*,*,*,*,*
四半期連結会計期間末日満期手形 名詞,数,*,*,*,*,*
EOS
I recommend you to use another tokenizer than BertJapaneseTokenizer
+ipadic
. See detail in my diary written in Japanese.
Thank you for your comment and for sharing the issue.
I have not noticed this ipadic issue.
Not only tokenization but also vocab.txt (making vocabulary process) would have the problem, namely the vocabulary wrongly has such a long word (it might be tokenized into some words such as '四半期', '連結', '会計', '期間', '末日', '満期', '手形'
).
Is this problem unique for ipadic
?
If so, one solution would be changing the dictionary ipadic
to unidic_lite
or unidic
and we need to pre-train our model with the dictionary again.
Is this problem unique for
ipadic
?
Maybe. At least unidic_lite
does not tokenize them in such a way:
>>> import fugashi,unidic_lite
>>> parser=fugashi.GenericTagger("-d "+unidic_lite.DICDIR).parse
>>> print(parser("四半期連結会計期間末日満期手形"))
四半 シハン シハン 四半 名詞-普通名詞-一般 0,2
期 キ キ 期 名詞-普通名詞-助数詞可能 1
連結 レンケツ レンケツ 連結 名詞-普通名詞-サ変可能 0
会計 カイケー カイケイ 会計 名詞-普通名詞-サ変可能 0
期間 キカン キカン 期間 名詞-普通名詞-一般 1,2
末日 マツジツ マツジツ 末日 名詞-普通名詞-一般 0
満期 マンキ マンキ 満期 名詞-普通名詞-一般 0,1
手形 テガタ テガタ 手形 名詞-普通名詞-一般 0
EOS
>>> print(parser("第3四半期連結会計期間末日満期手形"))
第 ダイ ダイ 第 接頭辞
3 3 3 3 名詞-数詞 0
四半 シハン シハン 四半 名詞-普通名詞-一般 0,2
期 キ キ 期 名詞-普通名詞-助数詞可能 1
連結 レンケツ レンケツ 連結 名詞-普通名詞-サ変可能 0
会計 カイケー カイケイ 会計 名詞-普通名詞-サ変可能 0
期間 キカン キカン 期間 名詞-普通名詞-一般 1,2
末日 マツジツ マツジツ 末日 名詞-普通名詞-一般 0
満期 マンキ マンキ 満期 名詞-普通名詞-一般 0,1
手形 テガタ テガタ 手形 名詞-普通名詞-一般 0
EOS
However, unidic_lite
(or unidic
) is based upon 国語研短単位, which is rather shorter unit of words for the purpose. I think that some longer unit, such as 国語研長単位, is suitable for FinTech. Would you try and make your own tokenizer?
As you mentioned, it seems that subword tokenization from long units like 長単位 would be better than using ipadic
or unidic(_lite)
.
I think it would be better to create a tokenizer, but it's difficult with my current resources...
Hi @retarfi I've just released Japanese-LUW-Tokenizer. It took me about 20 hours to make the tokenizer from 700MB orig.txt
(each UTF-8 sentence in each line) on 1GPU (NVIDIA GeForce RTX 2080):
import unicodedata
from tokenizers import CharBPETokenizer
from transformers import AutoModelForTokenClassification,AutoTokenizer,TokenClassificationPipeline,RemBertTokenizerFast
brt="KoichiYasuoka/bert-base-japanese-luw-upos"
mdl=AutoModelForTokenClassification.from_pretrained(brt)
tkz=AutoTokenizer.from_pretrained(brt)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,aggregation_strategy="simple",device=0)
with open("orig.txt","r",encoding="utf-8") as f, open("luw.txt","w",encoding="utf-8") as w:
d=[]
for r in f:
if r.strip()!="":
d.append(r.strip())
if len(d)>255:
for s in nlp(d):
print(" ".join(t["word"] for t in s),file=w)
d=[]
if len(d)>0:
for s in nlp(d):
print(" ".join(t["word"] for t in s),file=w)
alp=[c for c in tkz.convert_ids_to_tokens([i for i in range(len(tkz))]) if len(c)==1 and unicodedata.name(c).startswith("CJK")]
pst=tkz.backend_tokenizer.post_processor
tkz=CharBPETokenizer(lowercase=False,unk_token="[UNK]",suffix="")
tkz.normalizer.handle_chinese_chars=False
tkz.post_processor=pst
tkz.train(files=["luw.txt"],vocab_size=250300,min_frequency=2,limit_alphabet=20000,initial_alphabet=alp,special_tokens=["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]","<special0>","<special1>","<special2>","<special3>","<special4>","<special5>","<special6>","<special7>","<special8>","<special9>"],suffix="")
tkz.save("tokenizer.json")
tokenizer=RemBertTokenizerFast(tokenizer_file="tokenizer.json",vocab_file="/dev/null",bos_token="[CLS]",cls_token="[CLS]",unk_token="[UNK]",pad_token="[PAD]",mask_token="[MASK]",sep_token="[SEP]",do_lower_case=False,keep_accents=True)
tokenizer.save_pretrained("Japanese-LUW-Tokenizer")
vocab_size=250300
seems too big but acceptable. See detail in my diary written in Japanese.
Thank you for sharing! I will check it in detail.