kanishkamisra/minicons

Chinese word surprsial

Closed this issue · 3 comments

Hi,

The transformer models, like "bert-base-multilingual-uncased" could be introduced in "minicons" help compute taken surprisal or probabilities for different languages if we have a text
of this language as an input. This can be applied in English, German, Spanish and some alphabet-based languages.

However, it doesn't seem to work in Chinese. As you know, Chinese will be pre-processed with word segmentation. Despite this, if the input Chinese text is the one with word segments (two-character, or three and more character combination)), the output is still on the surprisal with each Chinese character in the text rather than segmented Chinese words. I am not sure how to solve this problem on computing surprisal values for Chinese words.

Many thanks!

The following is the example:

from minicons import scorer
import torch
from torch.utils.data import DataLoader

import numpy as np

import json
model = scorer.IncrementalLMScorer('bert-base-multilingual-uncased', 'cpu')
Using bos_token, but it is not set yet.
sentences = ["我 昨天 下午 我 就 是 直接 买 了 一份 那个 凉菜", "他们 那边 都 是 小 小葱 包括 重庆 那边"]
model.token_score(sentences, surprisal = True, base_two = True)
[[('[CLS]', 0.0),
  ('我', 16.780792236328125),
  ('昨', 18.67901039123535),
  ('天', 29.759370803833008),
  ('下', 39.109107971191406),
  ('午', 33.43532943725586),
  ('我', 34.247886657714844),
  ('就', 23.704923629760742),
  ('是', 25.778093338012695),
  ('直', 31.338485717773438),
  ('接', 28.79427146911621),
  ('买', 44.60960388183594),
  ('了', 30.6632022857666),
  ('一', 25.942493438720703),
  ('份', 44.91115188598633),
  ('那', 35.40247344970703),
  ('个', 37.76634979248047),
  ('凉', 35.126708984375),
  ('菜', 11.82837963104248),
  ('[SEP]', 32.64777755737305)],
 [('[CLS]', 0.0),
  ('他', 15.437037467956543),
  ('们', 10.030117988586426),
  ('那', 30.752634048461914),
  ('边', 45.248435974121094),
  ('都', 20.54657745361328),
  ('是', 27.90602684020996),
  ('小', 31.462167739868164),
  ('小', 2.2013779016560875e-05),
  ('葱', 17.992713928222656),
  ('包', 13.990900039672852),
  ('括', 34.425636291503906),
  ('重', 31.417207717895508),
  ('庆', 23.46117401123047),
  ('那', 34.11079788208008),
  ('边', 42.18030548095703),
  ('[SEP]', 36.32227325439453)]]

Thanks for using minicons! A couple of comments:

  1. BERT-based models are all Masked LMs so they should be instantiated using scorer.MaskedLMScorer; when you do that, the surprisal values are more "sane":

code:

model = scorer.MaskedLMScorer('bert-base-multilingual-uncased', 'cpu')

sentences = ["我 昨天 下午 我 就 是 直接 买 了 一份 那个 凉菜", "他们 那边 都 是 小 小葱 包括 重庆 那边"]

model.token_score(sentences, surprisal = True, base_two = True)

output:

[[('我', 3.289663553237915),
  ('昨', 6.386978626251221),
  ('天', 1.0551865100860596),
  ('下', 1.1172817945480347),
  ('午', 5.744592189788818),
  ('我', 4.506011486053467),
  ('就', 3.47978138923645),
  ('是', 1.6619055271148682),
  ('直', 0.08112353086471558),
  ('接', 0.17920592427253723),
  ('买', 5.799918174743652),
  ('了', 1.1132781505584717),
  ('一', 0.5585418343544006),
  ('份', 4.77261209487915),
  ('那', 5.216219902038574),
  ('个', 4.203708648681641),
  ('凉', 14.916685104370117),
  ('菜', 22.32535743713379)],
 [('他', 4.862102031707764),
  ('们', 0.0773811861872673),
  ('那', 2.65114426612854),
  ('边', 3.2553274631500244),
  ('都', 1.2650190591812134),
  ('是', 1.9610095024108887),
  ('小', 6.19584846496582),
  ('小', 7.71571683883667),
  ('葱', 18.485973358154297),
  ('包', 0.14176048338413239),
  ('括', 4.838305473327637),
  ('重', 0.5396705269813538),
  ('庆', 4.927285194396973),
  ('那', 3.8220901489257812),
  ('边', 15.283344268798828)]]
  1. In terms of the segmentation, unfortunately I think that is an issue of the model's tokenizer -- it was likely not created by keeping in mind the peculiarities of certain languages, and therefore incorrectly pre-processes the tokens in your input :(

Please let me know if (2) makes sense, and if you have any questions!

Thanks for your response!
However, I am not sure whether it is a problem on input texts. After the Chinese sentence was tokenized as word segmentations (split by space), the sentence is still processed in terms of individual character. Even after different splitters were inserted between words, it is processed as single character.

A Japanese sentence is tested. Unluckily, the Japanese sentence is dealt with as similarly as Chinese. As you know, words in a Japanese sentence will be segmented by space.

If you can solve this problem of word separation, it will greatly improve the performance of your software, because there are many East Asian languages are very Chinese like, actually in character units rather than words.

from minicons import scorer
import torch
from torch.utils.data import DataLoader

import numpy as np

import json
model = scorer.MaskedLMScorer('hfl/chinese-roberta-wwm-ext', 'cpu')
sentences = ["他边, 都, 是, 小, 小葱, 包括, 重庆, 那边"]
model.token_score(sentences, surprisal = True, base_two = True)
[[('他', 10.507850646972656),
  ('边', 9.173558235168457),
  (',', 2.7609708309173584),
  ('都', 7.200051784515381),
  (',', 9.363616943359375),
  ('是', 7.957242488861084),
  (',', 3.961794376373291),
  ('小', 6.596111297607422),
  (',', 4.97226619720459),
  ('小', 4.442281723022461),
  ('葱', 14.425054550170898),
  (',', 0.6053321361541748),
  ('包', 0.5206245183944702),
  ('括', 7.987270355224609),
  (',', 2.232102394104004),
  ('重', 0.7921300530433655),
  ('庆', 5.852537155151367),
  (',', 3.3460614681243896),
  ('那', 4.638041019439697),
  ('边', 3.7905919551849365)]]
sentences = ["他边 都 是 小 小葱 包括 重庆 那边"]
model.token_score(sentences, surprisal = True, base_two = True)
[[('他', 10.160442352294922),
  ('边', 9.153076171875),
  ('都', 0.4341392517089844),
  ('是', 0.589961051940918),
  ('小', 4.123552322387695),
  ('小', 8.044651985168457),
  ('葱', 12.513975143432617),
  ('包', 0.05252625420689583),
  ('括', 8.358875274658203),
  ('重', 0.160121351480484),
  ('庆', 2.044276475906372),
  ('那', 2.782963991165161),
  ('边', 4.05214262008667)]]
sentences = ["他边  都  是  小  小葱  包括  重庆  那边"]
model.token_score(sentences, surprisal = True, base_two = True)
[[('他', 10.160442352294922),
  ('边', 9.153076171875),
  ('都', 0.4341392517089844),
  ('是', 0.589961051940918),
  ('小', 4.123552322387695),
  ('小', 8.044651985168457),
  ('葱', 12.513975143432617),
  ('包', 0.05252625420689583),
  ('括', 8.358875274658203),
  ('重', 0.160121351480484),
  ('庆', 2.044276475906372),
  ('那', 2.782963991165161),
  ('边', 4.05214262008667)]]
sentences = ["他边",  "都",  "是",  "小",  "小葱",  "包括",  "重庆",  "那边"]
model.token_score(sentences, surprisal = True, base_two = True)
[[('他', 9.060617446899414), ('边', 15.51742172241211)],
 [('都', 17.21668815612793)],
 [('是', 14.000909805297852)],
 [('小', 16.088621139526367)],
 [('小', 5.2004923820495605), ('葱', 9.96257209777832)],
 [('包', 1.8796451091766357), ('括', 5.973520278930664)],
 [('重', 4.414781093597412), ('庆', 7.7569804191589355)],
 [('那', 6.560302734375), ('边', 10.817954063415527)]]
sentences = ["他边" "都" "是" "小" "小葱" "包括" "重庆" "那边"]
model.token_score(sentences, surprisal = True, base_two = True)
[[('他', 10.160442352294922),
  ('边', 9.153076171875),
  ('都', 0.4341392517089844),
  ('是', 0.589961051940918),
  ('小', 4.123552322387695),
  ('小', 8.044651985168457),
  ('葱', 12.513975143432617),
  ('包', 0.05252625420689583),
  ('括', 8.358875274658203),
  ('重', 0.160121351480484),
  ('庆', 2.044276475906372),
  ('那', 2.782963991165161),
  ('边', 4.05214262008667)]]
model = scorer.MaskedLMScorer('bert-base-multilingual-uncased', 'cpu')
sentences = ["今日 の お昼 は たくさん 食べて、気分 爽快 でした。"]
model.token_score(sentences, surprisal = True, base_two = True)
[[('今', 3.122051954269409),
  ('日', 1.2851203680038452),
  ('の', 3.3951520919799805),
  ('お', 5.956057548522949),
  ('昼', 7.492699146270752),
  ('は', 3.363112449645996),
  ('た', 12.163028717041016),
  ('##く', 3.6662654876708984),
  ('##さん', 11.690041542053223),
  ('食', 1.381737470626831),
  ('へて', 4.570037364959717),
  ('、', 4.791784763336182),
  ('気', 5.888495922088623),
  ('分', 10.172245025634766),
  ('爽', 17.73659896850586),
  ('快', 5.439576148986816),
  ('て', 5.930989742279053),
  ('##した', 9.311400413513184),
  ('。', 5.668310165405273)]]
sentences = ["今日のお昼はたくさん食べて、気分爽快でした."]
model.token_score(sentences, surprisal = True, base_two = True)
[[('今', 2.0951008796691895),
  ('日', 1.311732292175293),
  ('のお', 6.712447166442871),
  ('昼', 7.230654239654541),
  ('は', 2.305724620819092),
  ('##た', 4.359669208526611),
  ('##く', 4.073729515075684),
  ('##さん', 9.832992553710938),
  ('食', 1.1334619522094727),
  ('へて', 5.839948654174805),
  ('、', 4.790576934814453),
  ('気', 7.9265851974487305),
  ('分', 8.79090404510498),
  ('爽', 14.558496475219727),
  ('快', 4.72240686416626),
  ('て', 5.521145343780518),
  ('##した', 9.805434226989746),
  ('.', 1.6501832008361816)]]

Thanks again for a very detailed demonstration! I totally understand that the outputs are not desirable -- but all minicons does it simply use the tokenizers of the respective models from huggingface as they were trained and created by the original authors of these models, so unfortunately these errors propagate into minicons :(

I would suggest you create this issue of tokenization in the huggingface transformer repo (around which minicons was created): https://github.com/huggingface/transformers