How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

This is the official implementation of the paper titled: "How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese". To reproduce our results, please follow the following instructions.

1. Requirements

Python >= 3.9
PyTorch 1.8.1
Transformers 4.24.0.dev0

2. Installation

2.1 PyTorch

pip install torch==1.8.1+cu101 torchvision==0.9.1+cu101 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

2.2 Transformers

git clone https://github.com/huggingface/transformers.git
cd transformers
pip install -e .
cd ..

2.3 Other Python packages

pip install -r requirements.txt

2.4 Japanese Morphological Analyzers

Here, we install required packages under ${HOME}/usr, but you can choose your preferred location by modifying --prefix.

2.4.1 MeCab

Model

git clone https://github.com/taku910/mecab.git
cd mecab/mecab
./configure --prefix=${HOME}/usr --with-charset=UTF8
make
make install
cd ../..

Dictionary

wget "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM" -O mecab-ipadic-2.70-20070801.tar.gz
tar xvzf mecab-ipadic-2.7.0-20070801.tar.gz
cd mecab-ipadic-2.7.0-20070801
./configure --with-mecab-config=$HOME/usr/bin/mecab-config --with-charset=UTF8 --prefix=$HOME/usr
make
make install
cd ..

2.4.2 Juman++

wget "https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc3/jumanpp-2.0.0-rc3.tar.xz"
tar xvJf jumanpp-2.0.0-rc3.tar.xz
cd jumanpp-2.0.0-rc3
mkdir build && cd build
curl -LO https://github.com/catchorg/Catch2/releases/download/v2.13.8/catch.hpp
mv catch.hpp ../libs/
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$HOME/usr
make
make install
echo 'export PATH=$PATH:$HOME/usr' >> ~/.bashrc
echo 'export PATH=$PATH:$HOME/usr/bin' >> ~/.bashrc
cd ..

2.4.3 Sudachi

pip install sudachipy
pip install sudachidict_core

2.4.4 Vaporetto

See https://github.com/daac-tools/vaporetto for more details.

cd data/dict
wget https://github.com/daac-tools/vaporetto/releases/download/v0.5.0/bccwj-suw+unidic+tag.tar.xz
tar xf ./bccwj-suw+unidic+tag.tar.xz
cd ../..

3. Preprocessing data for tokenizer training

Please see preprocessing_for_tokenizers.

4. Training tokenizers

Please see tokenizer.

5. Preprocessing data for pretraining

Please see preprocessing_for_pretraining.

6. Pretraining

Please see pretraining.

7. Fine-tuning

7.1 JGLUE

First, please clone the JGLUE repository and download the JGLUE dataset under ./data, following https://github.com/yahoojapan/JGLUE.

7.1.1 MARC-ja

Please see marc-ja.

7.1.2 JSTS

Please see jsts.

7.1.3 JNLI

Please see jnli.

7.1.4 JSQuAD

Please see jsquad.

7.1.5 JCommonsenseQA

Please see jcommonsenseqa.

7.2 NER

Please see ner.

7.3 UD

Please see dependency_parsing.

Pretrained Weights

The pretrained weights are available on the Hugging Face Hub.

	BPE	Unigram	WordPiece
MeCab	bert-base-japanese_mecab-bpe	bert-base-japanese_mecab-unigram	bert-base-japanese_mecab-wordpiece
Juman++	bert-base-japanese_jumanpp-bpe	bert-base-japanese_jumanpp-unigram	bert-base-japanese_jumanpp-wordpiece
Sudachi	bert-base-japanese_sudachi-bpe	bert-base-japanese_sudachi-unigram	bert-base-japanese_sudachi-wordpiece
Vaporetto	bert-base-japanese_vaporetto-bpe	bert-base-japanese_vaporetto-unigram	bert-base-japanese_vaporetto-wordpiece
Nothing	bert-base-japanese_nothing-bpe	bert-base-japanese_nothing-unigram	bert-base-japanese_nothing-wordpiece

Dictionary files

The trained dictionary files are available from this repository.

	BPE	Unigram	WordPiece
MeCab	mecab_bpe.json	mecab_unigram.json	mecab_wordpiece.json
Juman++	jumanpp_bpe.json	jumanpp_unigram.json	jumanpp_wordpiece.json
Sudachi	sudachi_bpe.json	sudachi_unigram.json	sudachi_wordpiece.json
Vaporetto	vaporetto_bpe.json	vaporetto_unigram.json	vaporetto_wordpiece.json
Nothing	nothing_bpe.json	nothing_unigram.json	nothing_wordpiece.json

How to load our dictionary files

Because we use the customised tokenizers, we cannot use AutoTokenizer.from_pretrained() to load a dictionary file.
To load the file and construct a tokenizer, please use the following script. You must call build_tokenizer() to generate a tokenizer.

from typing import Optional

from tokenizers import Tokenizer
from tokenizers import NormalizedString, PreTokenizedString
from tokenizers.processors import BertProcessing
from tokenizers.pre_tokenizers import PreTokenizer
from transformers import PreTrainedTokenizerFast

from pyknp import Juman
from MeCab import Tagger
from sudachipy import tokenizer
from sudachipy import dictionary
import vaporetto

import mojimoji
import traceback
import textspan


class JumanPreTokenizer:
    def __init__(self):
        self.juman = Juman("jumanpp", multithreading=True)
    
    def tokenize(self, sequence: str) -> list[str]:
        text = mojimoji.han_to_zen(sequence).rstrip()
        try:
            result = self.juman.analysis(text)
        except:
            traceback.print_exc()
            text = ""
            result = self.juman.analysis(text)
        return [mrph.midasi for mrph in result.mrph_list()]
    
    def custom_split(self, i: int, normalized_string: NormalizedString) -> list[NormalizedString]:
        text = str(normalized_string)
        tokens = self.tokenize(text)
        tokens_spans = textspan.get_original_spans(tokens, text)
        return [normalized_string[st:ed] for cahr_spans in tokens_spans for st,ed in cahr_spans]
    
    def pre_tokenize(self, pretok: PreTokenizedString):
        pretok.split(self.custom_split)


class MecabPreTokenizer:
    def __init__(self, mecab_dict_path: Optional[str] = None):
        mecab_option = (f"-Owakati -d {mecab_dict_path}" if mecab_dict_path is not None else "-Owakati")
        self.mecab = Tagger(mecab_option)
    
    def tokenize(self, sequence: str) -> list[str]:
        return self.mecab.parse(sequence).strip().split(" ")
    
    def custom_split(self, i: int, normalized_string: NormalizedString) -> list[NormalizedString]:
        text = str(normalized_string)
        tokens = self.tokenize(text)
        tokens_spans = textspan.get_original_spans(tokens, text)
        return [normalized_string[st:ed] for cahr_spans in tokens_spans for st,ed in cahr_spans]
    
    def pre_tokenize(self, pretok: PreTokenizedString):
        pretok.split(self.custom_split)


class SudachiPreTokenizer:
    def __init__(self, mecab_dict_path: Optional[str] = None):
        self.sudachi = dictionary.Dictionary().create()
    
    def tokenize(self, sequence: str) -> list[str]:
        return [token.surface() for token in self.sudachi.tokenize(sequence)]
    
    def custom_split(self, i: int, normalized_string: NormalizedString) -> list[NormalizedString]:
        text = str(normalized_string)
        tokens = self.tokenize(text)
        tokens_spans = textspan.get_original_spans(tokens, text)
        return [normalized_string[st:ed] for cahr_spans in tokens_spans for st,ed in cahr_spans]
    
    def pre_tokenize(self, pretok: PreTokenizedString):
        pretok.split(self.custom_split)


class VaporettoPreTokenizer:
    def __init__(self, unidic_path: str):
        with open(unidic_path, 'rb') as fp:
            model = fp.read()
        self.tokenizer = vaporetto.Vaporetto(model, predict_tags=False)
    
    def tokenize(self, sequence: str) -> list[str]:
        tokens = self.tokenizer.tokenize(sequence)
        return [token.surface() for token in tokens]
    
    def custom_split(self, i: int, normalized_string: NormalizedString) -> list[NormalizedString]:
        text = str(normalized_string)
        tokens = self.tokenize(text)
        tokens_spans = textspan.get_original_spans(tokens, text)
        return [normalized_string[st:ed] for cahr_spans in tokens_spans for st,ed in cahr_spans]
    
    def pre_tokenize(self, pretok: PreTokenizedString):
        pretok.split(self.custom_split)


def build_tokenizer(
    dict_path: str, 
    pretokenizer_type: str = None,
    vaporetto_model_path: str = None
) -> PreTrainedTokenizerFast:
    # load a tokenizer
    tokenizer = Tokenizer.from_file(dict_path)
    # load a pre-tokenizer
    if pretokenizer_type == 'mecab':
        pre_tokenizer = MecabPreTokenizer()
    elif pretokenizer_type == 'jumanpp':
        pre_tokenizer = JumanPreTokenizer()
    elif pretokenizer_type == 'vaporetto':
        pre_tokenizer = VaporettoPreTokenizer(vaporetto_model_path)
    elif pretokenizer_type == 'sudachi':
        pre_tokenizer = SudachiPreTokenizer()
    elif pretokenizer_type == 'nothing':
        pre_tokenizer = None
    else:
        raise NotImplementedError()
    tokenizer.post_processor = BertProcessing(
        cls=("[CLS]", tokenizer.token_to_id('[CLS]')),
        sep=("[SEP]", tokenizer.token_to_id('[SEP]'))
    )
    # convert to PreTrainedTokenizerFast
    tokenizer = PreTrainedTokenizerFast(
        tokenizer_object=tokenizer,
        unk_token='[UNK]',
        cls_token='[CLS]',
        sep_token='[SEP]',
        pad_token='[PAD]',
        mask_token='[MASK]'
    )
    # set a pre-tokenizer
    if pre_tokenizer is not None:
        tokenizer._tokenizer.pre_tokenizer = PreTokenizer.custom(pre_tokenizer)
    return tokenizer

Citation

@inproceedings{fujii-etal-2023-how,
      title={How does the task complexity of masked pretraining objectives affect downstream performance?}, 
      author={Takuro Fujii and Koki Shibata and Atsuki Yamaguchi and Terufumi Morishita and Yasuhiro Sogawa},
      booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: Student Research Workshop",
      month = July,
      year = "2023",
      address = "Toronto, Canada",
      publisher = "Association for Computational Linguistics",
}

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License unless specified.

hitachi-nlp/compare-ja-tokenizer