BART(Bidirectional and Auto-Regressive Transformers)λ μ
λ ₯ ν
μ€νΈ μΌλΆμ λ
Έμ΄μ¦λ₯Ό μΆκ°νμ¬ μ΄λ₯Ό λ€μ μλ¬ΈμΌλ‘ 볡ꡬνλ autoencoder
μ ννλ‘ νμ΅μ΄ λ©λλ€. νκ΅μ΄ BART(μ΄ν KoBART) λ λ
Όλ¬Έμμ μ¬μ©λ Text Infilling
λ
Έμ΄μ¦ ν¨μλ₯Ό μ¬μ©νμ¬ 40GB μ΄μμ νκ΅μ΄ ν
μ€νΈμ λν΄μ νμ΅ν νκ΅μ΄ encoder-decoder
μΈμ΄ λͺ¨λΈμ
λλ€. μ΄λ₯Ό ν΅ν΄ λμΆλ KoBART-base
λ₯Ό λ°°ν¬ν©λλ€.
git clone https://github.com/SKT-AI/KoBART.git
cd KoBART
pip install -r requirements.txt
pip install .
Data | # of Sentences |
---|---|
Korean Wiki | 5M |
Other corpus | 0.27B |
νκ΅μ΄ μν€ λ°±κ³Ό μ΄μΈ, λ΄μ€, μ± , λͺ¨λμ λ§λμΉ (λν, λ΄μ€, ...), μ²μλ κ΅λ―Όμ²μ λ±μ λ€μν λ°μ΄ν°κ° λͺ¨λΈ νμ΅μ μ¬μ©λμμ΅λλ€.
tokenizers
ν¨ν€μ§μ Character BPE tokenizer
λ‘ νμ΅λμμ΅λλ€.
vocab
μ¬μ΄μ¦λ 30,000 μ΄λ©° λνμ μμ£Ό μ°μ΄λ μλμ κ°μ μ΄λͺ¨ν°μ½, μ΄λͺ¨μ§ λ±μ μΆκ°νμ¬ ν΄λΉ ν ν°μ μΈμ λ₯λ ₯μ μ¬λ Έμ΅λλ€.
π, π, π, π , π€£, .. ,
:-)
,:)
,-)
,(-:
...
λν <unused0>
~ <unused99>
λ±μ λ―Έμ¬μ© ν ν°μ μ μν΄ νμν subtasks
μ λ°λΌ μμ λ‘κ² μ μν΄ μ¬μ©ν μ μκ² νμ΅λλ€.
>>> from kobart import get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> kobart_tokenizer.tokenize("μλ
νμΈμ. νκ΅μ΄ BART μ
λλ€.π€£:)l^o")
['βμλ
ν', 'μΈμ.', 'βνκ΅μ΄', 'βB', 'A', 'R', 'T', 'βμ
', 'λλ€.', 'π€£', ':)', 'l^o']
Model | Training | # of params | Type | # of layers | # of heads | ffn_dim | hidden_dims |
---|---|---|---|---|---|---|---|
KoBART-base |
3 weeks (P40 GPU x 48) |
124M | Encoder | 6 | 16 | 3072 | 768 |
Decoder | 6 | 16 | 3072 | 768 |
>>> from transformers import BartModel
>>> from kobart import get_pytorch_kobart_model, get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> model = BartModel.from_pretrained(get_pytorch_kobart_model())
>>> inputs = kobart_tokenizer(['μλ
νμΈμ.'], return_tensors='pt')
>>> model(inputs['input_ids'])
Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4488, -4.3651, 3.2349, ..., 5.8916, 4.0497, 3.5468],
[-0.4096, -4.6106, 2.7189, ..., 6.1745, 2.9832, 3.0930]]],
grad_fn=<TransposeBackward0>), past_key_values=None, decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[ 0.4624, -0.2475, 0.0902, ..., 0.1127, 0.6529, 0.2203],
[ 0.4538, -0.2948, 0.2556, ..., -0.0442, 0.6858, 0.4372]]],
grad_fn=<TransposeBackward0>), encoder_hidden_states=None, encoder_attentions=None)
NSMC(acc) | KorSTS(spearman) | Question Pair(acc) | |
---|---|---|---|
KoBART-base | 90.07 | 81.31 | 93.80 |
μ λ°μ΄νΈ μμ
KoBART μ¬μ©ν ν₯λ―Έλ‘μ΄ μμ κ° μλ€λ©΄ PRμ£ΌμΈμ!
KoBART
κ΄λ ¨ μ΄μλ μ΄κ³³μ μ¬λ €μ£ΌμΈμ.
KoBART
λ modified MIT
λΌμ΄μ μ€ νμ 곡κ°λμ΄ μμ΅λλ€. λͺ¨λΈ λ° μ½λλ₯Ό μ¬μ©ν κ²½μ° λΌμ΄μ μ€ λ΄μ©μ μ€μν΄μ£ΌμΈμ. λΌμ΄μ μ€ μ λ¬Έμ LICENSE
νμΌμμ νμΈνμ€ μ μμ΅λλ€.