GoEmotions λ°μ΄ν°μ μ νκ΅μ΄λ‘ λ²μν ν, KoELECTRAλ‘ νμ΅
58000κ°μ Reddit commentsλ₯Ό 28κ°μ emotionμΌλ‘ λΌλ²¨λ§ν λ°μ΄ν°μ
- admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral
- torch==1.4.0
- transformers=2.9.1
- googletrans==2.4.1
- attrdict==2.0.1
$ pip3 install -r requirements.txt
π¨ Reddit λκΈλ‘ λ§λ λ°μ΄ν°μ¬μ λ²μλ κ²°κ³Όλ¬Όμ νμ§μ΄ μ’μ§ μμ΅λλ€. π¨
- pygoogletransλ₯Ό μ¬μ©νμ¬ νκ΅μ΄ λ°μ΄ν° μμ±
pygoogletrans v2.4.1
μ΄ pypiμ μ λ°μ΄νΈλμ§ μμ κ΄κ³λ‘ repositoryμμ 곧λ°λ‘ λΌμ΄λΈλ¬λ¦¬λ₯Ό μ€μΉνλ κ²μ κΆμ₯ (requirements.txt
μ λͺ μλμ΄ μμ)
- API νΈμΆ κ°μ 1.5μ΄μ κ°κ²©μ μ£Όμμ΅λλ€.
- ν λ²μ requestμ μ΅λ 5000μλ₯Ό λ£μ μ μλ μ μ κ³ λ €νμ¬ λ¬Έμ₯λ€μ
\r\n
μΌλ‘ μ΄μ΄ λΆμ¬ inputμΌλ‘ λ£μμ΅λλ€.
- ν λ²μ requestμ μ΅λ 5000μλ₯Ό λ£μ μ μλ μ μ κ³ λ €νμ¬ λ¬Έμ₯λ€μ
ββ​
(Zero-width space)κ° λ²μ λ¬Έμ₯ μμ μμΌλ©΄ λ²μμ΄ λμ§ μλ μ€λ₯κ° μμ΄μ μ΄λ μ κ±°νμμ΅λλ€.- λ²μμ μλ£ν λ°μ΄ν°λ
data
λλ ν 리μ μ΄λ―Έ μμ΅λλ€. νΉμ¬λ μ§μ λ²μμ λλ¦¬κ³ μΆλ€λ©΄ μλμ λͺ λ Ήμ΄λ₯Ό μ€ννλ©΄ λ©λλ€.
$ bash download_original_data.sh
$ pip3 install git+git://github.com/ssut/py-googletrans
$ python3 tranlate_data.py
- λ°μ΄ν°μ
μ
[NAME]
,[RELIGION]
μ Special Tokenμ΄ μ‘΄μ¬νμ¬, μ΄λ₯Όvocab.txt
μ[unused0]
μ[unused1]
μ κ°κ° ν λΉνμμ΅λλ€. transformers v2.9.1
κΈ°μ€μΌλ‘additional_special_tokens
μ μμ λ κ°μ ν ν°μ μΆκ°νμμμλ μ²λ¦¬κ° λμ§ μλ μ΄μκ° μμ΄ configλ₯Ό ν΅ν΄μκ° μλ code λ¨μμ μ§μ λ£μ΄μ€μΌ ν©λλ€. (Pipeline μ½λ μ°Έκ³ )
- Sigmoidλ₯Ό μ μ©ν Multi-label classification (thresholdλ 0.3μΌλ‘ μ§μ )
model.py
μElectraForMultiLabelClassification
μ°Έκ³
- configμ κ²½μ°
config
λλ ν 리μ json νμΌμμ λ³κ²½νλ©΄ λ©λλ€.
$ python3 run_goemotions.py --config_file koelectra-base.json
$ python3 run_goemotions.py --config_file koelectra-small.json
Macro F1
μ κΈ°μ€μΌλ‘ κ²°κ³Ό μΈ‘μ (Best result)
Macro F1 (%) | Dev | Test |
---|---|---|
KoELECTRA-Small | 36.92 | 37.87 |
KoELECTRA-Base | 40.34 | 41.54 |
MultiLabelPipeline
ν΄λμ€λ₯Ό μλ‘ λ§λ€μ΄ Multi-label classificationμ λν inferenceκ° κ°λ₯νκ² νμμ΅λλ€.- Huggingface s3μ
monologg/koelectra-base-finetuned-goemotions
μmonologg/koelectra-small-finetuned-goemotions
λͺ¨λΈμ μ λ‘λνμμ΅λλ€.
from multilabel_pipeline import MultiLabelPipeline
from transformers import ElectraTokenizer
from model import ElectraForMultiLabelClassification
from pprint import pprint
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-finetuned-goemotions")
tokenizer.add_special_tokens({"additional_special_tokens": ["[NAME]", "[RELIGION]"]}) # BUG: It should be hard-coded on transformers v2.9.1
model = ElectraForMultiLabelClassification.from_pretrained("monologg/koelectra-base-finetuned-goemotions")
goemotions = MultiLabelPipeline(
model=model,
tokenizer=tokenizer,
threshold=0.3
)
texts = [
"μ ν μ¬λ―Έ μμ§ μμ΅λλ€ ...",
"λλ βμ§κΈ κ°μ₯ ν° λλ €μμ λ΄ μμ μμ μ¬λ κ²β μ΄λΌκ³ λ§νλ€.",
"κ³±μ°½... νμκ°λ° κΈ°λ€λ¦΄ λ§μ μλ!",
"μ μ νλ 곡κ°μ μ μ νλ μ¬λλ€λ‘ μ±μΈλ",
"λ무 μ’μ",
"λ₯λ¬λμ μ§μ¬λμ€μΈ νμμ
λλ€!",
"λ§μμ΄ κΈν΄μ§λ€.",
"μλ μ§μ§ λ€λ€ λ―Έμ³€λ봨γ
γ
γ
",
"κ°λ
ΈμΌ"
]
pprint(goemotions(texts))
# Output
[{'labels': ['disapproval'], 'scores': [0.82489157]},
{'labels': ['fear'], 'scores': [0.9509703]},
{'labels': ['neutral'], 'scores': [0.9585297]},
{'labels': ['approval', 'neutral'], 'scores': [0.62351847, 0.34225133]},
{'labels': ['admiration'], 'scores': [0.97146636]},
{'labels': ['love', 'neutral'], 'scores': [0.32616842, 0.5455638]},
{'labels': ['caring', 'nervousness'], 'scores': [0.51289016, 0.4741806]},
{'labels': ['amusement'], 'scores': [0.9680228]},
{'labels': ['anger', 'annoyance'], 'scores': [0.5345557, 0.764603]}]