/KoBERT-Transformers

KoBERT on ๐Ÿค— Huggingface Transformers ๐Ÿค— (with Bug Fixed)

Primary LanguagePythonApache License 2.0Apache-2.0

KoBERT-Transformers

โš ๏ธ Deprecated (2024.08.21) โš ๏ธ

  • ์ดˆ์ฐฝ๊ธฐ transformers (4๋…„ ์ „) ์—์„œ๋Š” custom code๋ฅผ ์ด์šฉํ•œ Auto mapping์ด ๋ถˆ๊ฐ€ํ•˜์—ฌ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ˜•ํƒœ๋กœ ์ œ๊ณตํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • trust_remote_code=True ์˜ต์…˜์„ ํ†ตํ•ด AutoModel, AutoTokenizer๋กœ ๋ฐ”๋กœ ํ˜ธ์ถœ์ด ๊ฐ€๋Šฅํ•˜๊ฒŒ ์ˆ˜์ •ํ•˜์˜€๊ณ , ์ด์— ๋”ฐ๋ผ ํ•ด๋‹น ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ๋” ์ด์ƒ ์œ ์ง€๋ณด์ˆ˜ํ•˜์ง€ ์•Š์„ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

KoBERT & DistilKoBERT on ๐Ÿค— Huggingface Transformers ๐Ÿค—

KoBERT ๋ชจ๋ธ์€ ๊ณต์‹ ๋ ˆํฌ์˜ ๊ฒƒ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋ ˆํฌ๋Š” Huggingface tokenizer์˜ ๋ชจ๋“  API๋ฅผ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Changes

๐Ÿ™ TL;DR

  1. transformers ๋Š” v3.0 ์ด์ƒ์„ ๋ฐ˜๋“œ์‹œ ์„ค์น˜!
  2. tokenizer๋Š” ๋ณธ ๋ ˆํฌ์˜ kobert_transformers/tokenization_kobert.py๋ฅผ ์‚ฌ์šฉ!

1. Tokenizer ํ˜ธํ™˜

Huggingface Transformers๊ฐ€ v2.9.0๋ถ€ํ„ฐ tokenization ๊ด€๋ จ API๊ฐ€ ์ผ๋ถ€ ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด์— ๋งž์ถฐ ๊ธฐ์กด์˜ tokenization_kobert.py๋ฅผ ์ƒ์œ„ ๋ฒ„์ „์— ๋งž๊ฒŒ ์ˆ˜์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

2. Embedding์˜ padding_idx ์ด์Šˆ

์ด์ „๋ถ€ํ„ฐ BertModel์˜ BertEmbeddings์—์„œ padding_idx=0์œผ๋กœ Hard-coding๋˜์–ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. (์•„๋ž˜ ์ฝ”๋“œ ์ฐธ๊ณ )

class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

๊ทธ๋Ÿฌ๋‚˜ Sentencepiece์˜ ๊ฒฝ์šฐ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ pad_token_id=1, unk_token_id=0์œผ๋กœ ์„ค์ •์ด ๋˜์–ด ์žˆ๊ณ  (์ด๋Š” KoBERT๋„ ๋™์ผ), ์ด๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” BertModel์˜ ๊ฒฝ์šฐ ์›์น˜ ์•Š์€ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Huggingface์—์„œ๋„ ์ตœ๊ทผ์— ํ•ด๋‹น ์ด์Šˆ๋ฅผ ์ธ์ง€ํ•˜์—ฌ ์ด๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ v2.9.0์— ๋ฐ˜์˜ํ•˜์˜€์Šต๋‹ˆ๋‹ค. (๊ด€๋ จ PR #3793) config์— pad_token_id=1 ์„ ์ถ”๊ฐ€ ๊ฐ€๋Šฅํ•˜์—ฌ ์ด๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

๊ทธ๋Ÿฌ๋‚˜ v.2.9.0์—์„œ DistilBERT, ALBERT ๋“ฑ์—๋Š” ์ด ์ด์Šˆ๊ฐ€ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์•„ ์ง์ ‘ PR์„ ์˜ฌ๋ ค ์ฒ˜๋ฆฌํ•˜์˜€๊ณ  (๊ด€๋ จ PR #3965), v2.9.1์— ์ตœ์ข…์ ์œผ๋กœ ๋ฐ˜์˜๋˜์–ด ๋ฐฐํฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” ์ด์ „๊ณผ ํ˜„์žฌ ๋ฒ„์ „์˜ ์ฐจ์ด์ ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

# Transformers v2.7.0
>>> from transformers import BertModel, DistilBertModel
>>> model = BertModel.from_pretrained("monologg/kobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=0)
>>> model = DistilBertModel.from_pretrained("monologg/distilkobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=0)


### Transformers v2.9.1
>>> from transformers import BertModel, DistilBertModel
>>> model = BertModel.from_pretrained("monologg/kobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=1)
>>> model = DistilBertModel.from_pretrained("monologg/distilkobert")
>>> model.embeddings.word_embeddings
Embedding(8002, 768, padding_idx=1)

KoBERT / DistilKoBERT on ๐Ÿค— Transformers ๐Ÿค—

Dependencies

  • torch>=1.1.0
  • transformers>=3,<5

How to Use

>>> from transformers import BertModel, DistilBertModel
>>> bert_model = BertModel.from_pretrained('monologg/kobert')
>>> distilbert_model = DistilBertModel.from_pretrained('monologg/distilkobert')

Tokenizer๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด, kobert_transformers/tokenization_kobert.py ํŒŒ์ผ์„ ๋ณต์‚ฌํ•œ ํ›„, KoBertTokenizer๋ฅผ ์ž„ํฌํŠธํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

  • KoBERT์™€ DistilKoBERT ๋ชจ๋‘ ๋™์ผํ•œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๊ธฐ์กด KoBERT์˜ ๊ฒฝ์šฐ Special Token์ด ์ œ๋Œ€๋กœ ๋ถ„๋ฆฌ๋˜์ง€ ์•Š๋Š” ์ด์Šˆ๊ฐ€ ์žˆ์–ด์„œ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ˆ˜์ •ํ•˜์—ฌ ๋ฐ˜์˜ํ•˜์˜€์Šต๋‹ˆ๋‹ค. (Issue link)
>>> from tokenization_kobert import KoBertTokenizer
>>> tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') # monologg/distilkobert๋„ ๋™์ผ
>>> tokenizer.tokenize("[CLS] ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. [SEP]")
>>> ['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]'])
>>> [2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]

Kobert-Transformers (Pip library)

PyPI license Downloads

  • tokenization_kobert.py๋ฅผ ๋žฉํ•‘ํ•œ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • KoBERT, DistilKoBERT๋ฅผ Huggingface Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ˜•ํƒœ๋กœ ์ œ๊ณต
  • v0.5.1์ด์ƒ๋ถ€ํ„ฐ๋Š” transformers v3.0 ์ด์ƒ์œผ๋กœ ๊ธฐ๋ณธ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค. (transformers v4.0 ๊นŒ์ง€๋Š” ์ด์Šˆ ์—†์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅ)

Install Kobert-Transformers

pip3 install kobert-transformers

How to Use

>>> import torch
>>> from kobert_transformers import get_kobert_model, get_distilkobert_model
>>> model = get_kobert_model()
>>> model.eval()
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
>>> sequence_output, pooled_output = model(input_ids, attention_mask, token_type_ids)
>>> sequence_output[0]
tensor([[-0.2461,  0.2428,  0.2590,  ..., -0.4861, -0.0731,  0.0756],
        [-0.2478,  0.2420,  0.2552,  ..., -0.4877, -0.0727,  0.0754],
        [-0.2472,  0.2420,  0.2561,  ..., -0.4874, -0.0733,  0.0765]],
       grad_fn=<SelectBackward>)
>>> from kobert_transformers import get_tokenizer
>>> tokenizer = get_tokenizer()
>>> tokenizer.tokenize("[CLS] ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. [SEP]")
['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'โ–ํ•œ๊ตญ', '์–ด', 'โ–๋ชจ๋ธ', '์„', 'โ–๊ณต์œ ', 'ํ•ฉ๋‹ˆ๋‹ค', '.', '[SEP]'])
[2, 4958, 6855, 2046, 7088, 1050, 7843, 54, 3]

Reference