/HanBert-Transformers

HanBert on πŸ€— Huggingface Transformers πŸ€—

Primary LanguagePython

HanBert-Transformers

HanBert on πŸ€— Huggingface Transformers πŸ€—

Details

  • HanBert Tensorflow ckptλ₯Ό Pytorch둜 λ³€ν™˜
    • 기쑴의 Optimizer κ΄€λ ¨ ParameterλŠ” μ œκ±°ν•˜μ—¬ 기쑴의 1.43GBμ—μ„œ 488MB둜 μ€„μ˜€μŠ΅λ‹ˆλ‹€.
    • λ³€ν™˜ μ‹œ Optimizer κ΄€λ ¨ νŒŒλΌλ―Έν„°λ₯Ό Skip ν•˜μ§€ λͺ»ν•˜λŠ” μ΄μŠˆκ°€ μžˆμ–΄ ν•΄λ‹Ή 뢀뢄을 κ³ μ³μ„œ λ³€ν™˜ν–ˆμŠ΅λ‹ˆλ‹€. (ν•΄λ‹Ή 이슈 κ΄€λ ¨ PR)
# transformers bert TF_CHECKPOINT TF_CONFIG PYTORCH_DUMP_OUTPUT
$ transformers bert HanBert-54kN/model.ckpt-3000000 \
                    HanBert-54kN/bert_config.json \
                    HanBert-54kN/pytorch_model.bin
  • Tokenizerλ₯Ό μœ„ν•˜μ—¬ tokenization_hanbert.py νŒŒμΌμ„ μƒˆλ‘œ μ œμž‘
    • Transformers의 tokenization κ΄€λ ¨ ν•¨μˆ˜ 지원 (convert_tokens_to_ids, convert_tokens_to_string, encode_plus...)

How to Use

  1. κ΄€λ ¨ 라이브러리 μ„€μΉ˜

    • torch>=1.1.0
    • transformers>=2.2.2
  2. λͺ¨λΈ λ‹€μš΄λ‘œλ“œ ν›„ μ••μΆ• ν•΄μ œ

    • 기쑴의 HanBertμ—μ„œλŠ” tokenization κ΄€λ ¨ νŒŒμΌμ„ /usr/local/moran에 볡사해야 ν–ˆμ§€λ§Œ, ν•΄λ‹Ή 폴더 이용 μ‹œ κ·Έλž˜λ„ μ‚¬μš© κ°€λŠ₯ν•©λ‹ˆλ‹€.
    • λ‹€μš΄λ‘œλ“œ 링크 (Pretrained weight + Tokenizer tool)
  3. tokenization_hanbert.py μ€€λΉ„

    • Tokenizer의 경우 Ubuntu ν™˜κ²½μ—μ„œλ§Œ μ‚¬μš© κ°€λŠ₯ν•©λ‹ˆλ‹€.
    • ν•˜λ‹¨μ˜ ν˜•νƒœλ‘œ 디렉토리가 μ„ΈνŒ…μ΄ λ˜μ–΄ μžˆμ–΄μ•Ό ν•©λ‹ˆλ‹€.
.
β”œβ”€β”€ ...
β”œβ”€β”€ HanBert-54kN-torch
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ pytorch_model.bin
β”‚   β”œβ”€β”€ vocab_54k.txt
β”‚   β”œβ”€β”€ libmoran4dnlp.so
β”‚   β”œβ”€β”€ moran.db
β”‚   β”œβ”€β”€ udict.txt
β”‚   └── uentity.txt
β”œβ”€β”€ tokenization_hanbert.py
└── ...

Example

1. Model

>>> import torch
>>> from transformers import BertModel

>>> model = BertModel.from_pretrained('HanBert-54kN-torch')
>>> input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
>>> token_type_ids = torch.LongTensor([[0, 0, 0], [0, 0, 0]])
>>> attention_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
>>> sequence_output, pooled_output = model(input_ids, attention_mask, token_type_ids)
>>> sequence_output
tensor([[[-0.0938, -0.5030,  0.3765,  ..., -0.4880, -0.4486,  0.3600],
         [-0.6036, -0.1008, -0.2344,  ..., -0.6606, -0.5762,  0.1021],
         [-0.4353,  0.0970, -0.0781,  ..., -0.7686, -0.4418,  0.4109]],

        [[-0.7117,  0.2479, -0.8164,  ...,  0.1509,  0.8337,  0.4054],
         [-0.7867, -0.0443, -0.8754,  ...,  0.0952,  0.5044,  0.5125],
         [-0.8613,  0.0138, -0.9315,  ...,  0.1651,  0.6647,  0.5509]]],
       grad_fn=<AddcmulBackward>)

2. Tokenizer

>>> from tokenization_hanbert import HanBertTokenizer
>>> tokenizer = HanBertTokenizer.from_pretrained('HanBert-54kN-torch')
>>> text = "λ‚˜λŠ” κ±Έμ–΄κ°€κ³  μžˆλŠ” μ€‘μž…λ‹ˆλ‹€. λ‚˜λŠ”κ±Έμ–΄ κ°€κ³ μžˆλŠ” μ€‘μž…λ‹ˆλ‹€. 잘 λΆ„λ₯˜λ˜κΈ°λ„ ν•œλ‹€. 잘 먹기도 ν•œλ‹€."
>>> tokenizer.tokenize(text)
['λ‚˜', '~~λŠ”', 'κ±Έμ–΄κ°€', '~~κ³ ', '있', '~~λŠ”', '쀑', '~~μž…', '~~λ‹ˆλ‹€', '.', 'λ‚˜', '##λŠ”κ±Έ', '##μ–΄', 'κ°€', '~~κ³ ', '~있', '~~λŠ”', '쀑', '~~μž…', '~~λ‹ˆλ‹€', '.', '잘', 'λΆ„λ₯˜', '~~되', '~~κΈ°', '~~도', 'ν•œ', '~~λ‹€', '.', '잘', 'λ¨Ή', '~~κΈ°', '~~도', 'ν•œ', '~~λ‹€', '.']

3. Test with python file

$ python3 test_hanbert.py --model_name_or_path HanBert-54kN-torch
$ python3 test_hanbert.py --model_name_or_path HanBert-54kN-IP-torch

Result on Subtask

max_seq_len = 50으둜 μ„€μ •

NSMC (acc) Naver-NER (F1)
HanBert-54kN 90.16 87.31
HanBert-54kN-IP 88.72 86.57
KoBERT 89.63 86.11
Bert-multilingual 87.07 84.20

Reference