/KoBART

Korean BART

Primary LanguagePythonOtherNOASSERTION

🀣 KoBART

BART(Bidirectional and Auto-Regressive Transformers)λŠ” μž…λ ₯ ν…μŠ€νŠΈ 일뢀에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ 이λ₯Ό λ‹€μ‹œ μ›λ¬ΈμœΌλ‘œ λ³΅κ΅¬ν•˜λŠ” autoencoder의 ν˜•νƒœλ‘œ ν•™μŠ΅μ΄ λ©λ‹ˆλ‹€. ν•œκ΅­μ–΄ BART(μ΄ν•˜ KoBART) λŠ” λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ Text Infilling λ…Έμ΄μ¦ˆ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 40GB μ΄μƒμ˜ ν•œκ΅­μ–΄ ν…μŠ€νŠΈμ— λŒ€ν•΄μ„œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ encoder-decoder μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이λ₯Ό 톡해 λ„μΆœλœ KoBART-baseλ₯Ό λ°°ν¬ν•©λ‹ˆλ‹€.

How to install

pip install git+https://github.com/SKT-AI/KoBART#egg=kobart

Data

Data # of Sentences
Korean Wiki 5M
Other corpus 0.27B

ν•œκ΅­μ–΄ μœ„ν‚€ λ°±κ³Ό 이외, λ‰΄μŠ€, μ±…, λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ v1.0(λŒ€ν™”, λ‰΄μŠ€, ...), μ²­μ™€λŒ€ ꡭ민청원 λ“±μ˜ λ‹€μ–‘ν•œ 데이터가 λͺ¨λΈ ν•™μŠ΅μ— μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

Tokenizer

tokenizers νŒ¨ν‚€μ§€μ˜ Character BPE tokenizer둜 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

vocab μ‚¬μ΄μ¦ˆλŠ” 30,000 이며 λŒ€ν™”μ— 자주 μ“°μ΄λŠ” μ•„λž˜μ™€ 같은 이λͺ¨ν‹°μ½˜, 이λͺ¨μ§€ 등을 μΆ”κ°€ν•˜μ—¬ ν•΄λ‹Ή ν† ν°μ˜ 인식 λŠ₯λ ₯을 μ˜¬λ ΈμŠ΅λ‹ˆλ‹€.

πŸ˜€, 😁, πŸ˜†, πŸ˜…, 🀣, .. , :-), :), -), (-:...

λ˜ν•œ <unused0> ~ <unused99>λ“±μ˜ λ―Έμ‚¬μš© 토큰을 μ •μ˜ν•΄ ν•„μš”ν•œ subtasks에 따라 자유둭게 μ •μ˜ν•΄ μ‚¬μš©ν•  수 있게 ν–ˆμŠ΅λ‹ˆλ‹€.

>>> from kobart import get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> kobart_tokenizer.tokenize("μ•ˆλ…•ν•˜μ„Έμš”. ν•œκ΅­μ–΄ BART μž…λ‹ˆλ‹€.🀣:)l^o")
['β–μ•ˆλ…•ν•˜', 'μ„Έμš”.', 'β–ν•œκ΅­μ–΄', '▁B', 'A', 'R', 'T', 'β–μž…', 'λ‹ˆλ‹€.', '🀣', ':)', 'l^o']

Model

Model # of params Type # of layers # of heads ffn_dim hidden_dims
KoBART-base 124M Encoder 6 16 3072 768
Decoder 6 16 3072 768
>>> from transformers import BartModel
>>> from kobart import get_pytorch_kobart_model, get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> model = BartModel.from_pretrained(get_pytorch_kobart_model())
>>> inputs = kobart_tokenizer(['μ•ˆλ…•ν•˜μ„Έμš”.'], return_tensors='pt')
>>> model(inputs['input_ids'])
Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4418, -4.3673,  3.2404,  ...,  5.8832,  4.0629,  3.5540],
         [-0.1316, -4.6446,  2.5955,  ...,  6.0093,  2.7467,  3.0007]]],
       grad_fn=<NativeLayerNormBackward>), past_key_values=((tensor([[[[-9.7980e-02, -6.6584e-01, -1.8089e+00,  ...,  9.6023e-01, -1.8818e-01, -1.3252e+00],

Performances

Classification or Regression

NSMC(acc) KorSTS(spearman) Question Pair(acc)
KoBART-base 90.24 81.66 94.34

Summarization

μ—…λ°μ΄νŠΈ μ˜ˆμ •

Demos

μœ„ μ˜ˆμ‹œλŠ” ZDNET 기사λ₯Ό μš”μ•½ν•œ κ²°κ³Όμž„

Examples

KoBARTλ₯Ό μ‚¬μš©ν•œ ν₯미둜운 μ˜ˆμ œκ°€ μžˆλ‹€λ©΄ PRμ£Όμ„Έμš”!

Contacts

KoBART κ΄€λ ¨ μ΄μŠˆλŠ” 이곳에 μ˜¬λ €μ£Όμ„Έμš”.

Changes

  • V0.3
    • ν† ν¬λ‚˜μ΄μ € λ²„κ·Έλ‘œ 인해 <unk> 토큰이 μ‚¬λΌμ§€λŠ” 이슈 ν•΄κ²°
  • V0.2
    • KoBART λͺ¨λΈ μ—…λ°μ΄νŠΈ(μ„œλΈŒν…ŒμŠ€νŠΈ sample efficientκ°€ 쒋아짐)
    • λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ μ‚¬μš© 버전 λͺ…μ‹œ
    • downloder 버그 μˆ˜μ •
    • pip μ„€μΉ˜ 지원

License

KoBARTλŠ” modified MIT λΌμ΄μ„ μŠ€ ν•˜μ— κ³΅κ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. λͺ¨λΈ 및 μ½”λ“œλ₯Ό μ‚¬μš©ν•  경우 λΌμ΄μ„ μŠ€ λ‚΄μš©μ„ μ€€μˆ˜ν•΄μ£Όμ„Έμš”. λΌμ΄μ„ μŠ€ 전문은 LICENSE νŒŒμΌμ—μ„œ ν™•μΈν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.