BART(Bidirectional and Auto-Regressive Transformers)λ μ
λ ₯ ν
μ€νΈ μΌλΆμ λ
Έμ΄μ¦λ₯Ό μΆκ°νμ¬ μ΄λ₯Ό λ€μ μλ¬ΈμΌλ‘ 볡ꡬνλ autoencoder
μ ννλ‘ νμ΅μ΄ λ©λλ€. νκ΅μ΄ BART(μ΄ν KoBART) λ λ
Όλ¬Έμμ μ¬μ©λ Text Infilling
λ
Έμ΄μ¦ ν¨μλ₯Ό μ¬μ©νμ¬ 40GB μ΄μμ νκ΅μ΄ ν
μ€νΈμ λν΄μ νμ΅ν νκ΅μ΄ encoder-decoder
μΈμ΄ λͺ¨λΈμ
λλ€. μ΄λ₯Ό ν΅ν΄ λμΆλ KoBART-base
λ₯Ό λ°°ν¬ν©λλ€.
pip install git+https://github.com/ajb3296/KoBART.git
Data | # of Sentences |
---|---|
Korean Wiki | 5M |
Other corpus | 0.27B |
νκ΅μ΄ μν€ λ°±κ³Ό μ΄μΈ, λ΄μ€, μ± , λͺ¨λμ λ§λμΉ v1.0(λν, λ΄μ€, ...), μ²μλ κ΅λ―Όμ²μ λ±μ λ€μν λ°μ΄ν°κ° λͺ¨λΈ νμ΅μ μ¬μ©λμμ΅λλ€.
tokenizers
ν¨ν€μ§μ Character BPE tokenizer
λ‘ νμ΅λμμ΅λλ€.
vocab
μ¬μ΄μ¦λ 30,000 μ΄λ©° λνμ μμ£Ό μ°μ΄λ μλμ κ°μ μ΄λͺ¨ν°μ½, μ΄λͺ¨μ§ λ±μ μΆκ°νμ¬ ν΄λΉ ν ν°μ μΈμ λ₯λ ₯μ μ¬λ Έμ΅λλ€.
π, π, π, π , π€£, .. ,
:-)
,:)
,-)
,(-:
...
λν <unused0>
~ <unused99>
λ±μ λ―Έμ¬μ© ν ν°μ μ μν΄, νμν subtasks
μ λ°λΌ μμ λ‘κ² μ μν΄ μ¬μ©ν μ μκ² νμ΅λλ€.
>>> from kobart import get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> kobart_tokenizer.tokenize("μλ
νμΈμ. νκ΅μ΄ BART μ
λλ€.π€£:)l^o")
['βμλ
ν', 'μΈμ.', 'βνκ΅μ΄', 'βB', 'A', 'R', 'T', 'βμ
', 'λλ€.', 'π€£', ':)', 'l^o']
Model | # of params | Type | # of layers | # of heads | ffn_dim | hidden_dims |
---|---|---|---|---|---|---|
KoBART-base |
124M | Encoder | 6 | 16 | 3072 | 768 |
Decoder | 6 | 16 | 3072 | 768 |
>>> from transformers import BartModel
>>> from kobart import get_pytorch_kobart_model, get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> model = BartModel.from_pretrained(get_pytorch_kobart_model())
>>> inputs = kobart_tokenizer(['μλ
νμΈμ.'], return_tensors='pt')
>>> model(inputs['input_ids'])
Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4418, -4.3673, 3.2404, ..., 5.8832, 4.0629, 3.5540],
[-0.1316, -4.6446, 2.5955, ..., 6.0093, 2.7467, 3.0007]]],
grad_fn=<NativeLayerNormBackward>), past_key_values=((tensor([[[[-9.7980e-02, -6.6584e-01, -1.8089e+00, ..., 9.6023e-01, -1.8818e-01, -1.3252e+00],
NSMC(acc) | KorSTS(spearman) | Question Pair(acc) | |
---|---|---|---|
----------------------------------- | |||
KoBART-base | 90.24 | 81.66 | 94.34 |
- μ λ°μ΄νΈ μμ *
μ μμλ ZDNET κΈ°μ¬λ₯Ό μμ½ν κ²°κ³Όμ
- NSMC Classification
- KoBART ChitChatBot
- KoBART Summarization
- KoBART Translation
- LegalQA using SentenceKoBART
- KoBART Question Generation
KoBARTλ₯Ό μ¬μ©ν ν₯λ―Έλ‘μ΄ μμ κ° μλ€λ©΄ PRμ£ΌμΈμ!
- v0.5.1
- guide default 'import statements'
- v0.5
- download large files from
aws s3
- download large files from
- v0.4
- Update model binary
- v0.3
- ν ν¬λμ΄μ λ²κ·Έλ‘ μΈν΄
<unk>
ν ν°μ΄ μ¬λΌμ§λ μ΄μ ν΄κ²°
- ν ν¬λμ΄μ λ²κ·Έλ‘ μΈν΄
- v0.2
KoBART
λͺ¨λΈ μ λ°μ΄νΈ(μλΈν μ€νΈ sample efficientκ° μ’μμ§)λͺ¨λμ λ§λμΉ
μ¬μ© λ²μ λͺ μ- downloder λ²κ·Έ μμ
pip
μ€μΉ μ§μ
KoBART
κ΄λ ¨ μ΄μλ μ΄κ³³μ μ¬λ €μ£ΌμΈμ.
KoBART
λ modified MIT
λΌμ΄μ μ€ νμ 곡κ°λμ΄ μμ΅λλ€. λͺ¨λΈ λ° μ½λλ₯Ό μ¬μ©ν κ²½μ° λΌμ΄μ μ€ λ΄μ©μ μ€μν΄μ£ΌμΈμ. λΌμ΄μ μ€ μ λ¬Έμ LICENSE
νμΌμμ νμΈνμ€ μ μμ΅λλ€.