/GoEmotions-Korean

Korean version of GoEmotions Dataset 😍😒😱

Primary LanguagePythonApache License 2.0Apache-2.0

GoEmotions-Korean

GoEmotions 데이터셋을 ν•œκ΅­μ–΄λ‘œ λ²ˆμ—­ν•œ ν›„, KoELECTRA둜 ν•™μŠ΅

GoEmotions

58000개의 Reddit commentsλ₯Ό 28개의 emotion으둜 λΌλ²¨λ§ν•œ 데이터셋

  • admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral

Requirements

  • torch==1.4.0
  • transformers=2.9.1
  • googletrans==2.4.1
  • attrdict==2.0.1
$ pip3 install -r requirements.txt

Translated Data

🚨 Reddit λŒ“κΈ€λ‘œ λ§Œλ“  λ°μ΄ν„°μ—¬μ„œ λ²ˆμ—­λœ 결과물의 ν’ˆμ§ˆμ΄ 쒋지 μ•ŠμŠ΅λ‹ˆλ‹€. 🚨

  • pygoogletransλ₯Ό μ‚¬μš©ν•˜μ—¬ ν•œκ΅­μ–΄ 데이터 생성
    • pygoogletrans v2.4.1이 pypi에 μ—…λ°μ΄νŠΈλ˜μ§€ μ•Šμ€ κ΄€κ³„λ‘œ repositoryμ—μ„œ κ³§λ°”λ‘œ 라이브러리λ₯Ό μ„€μΉ˜ν•˜λŠ” 것을 ꢌμž₯ (requirements.txt에 λͺ…μ‹œλ˜μ–΄ 있음)
  • API 호좜 간에 1.5초의 간격을 μ£Όμ—ˆμŠ΅λ‹ˆλ‹€.
    • ν•œ 번의 request에 μ΅œλŒ€ 5000자λ₯Ό 넣을 수 μžˆλŠ” 점을 κ³ λ €ν•˜μ—¬ λ¬Έμž₯듀을 \r\n으둜 이어 λΆ™μ—¬ input으둜 λ„£μ—ˆμŠ΅λ‹ˆλ‹€.
  • ​​​(Zero-width space)κ°€ λ²ˆμ—­ λ¬Έμž₯ μ•ˆμ— 있으면 λ²ˆμ—­μ΄ λ˜μ§€ μ•ŠλŠ” 였λ₯˜κ°€ μžˆμ–΄μ„œ μ΄λŠ” μ œκ±°ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • λ²ˆμ—­μ„ μ™„λ£Œν•œ λ°μ΄ν„°λŠ” data 디렉토리에 이미 μžˆμŠ΅λ‹ˆλ‹€. ν˜Ήμ—¬λ‚˜ 직접 λ²ˆμ—­μ„ 돌리고 μ‹Άλ‹€λ©΄ μ•„λž˜μ˜ λͺ…λ Ήμ–΄λ₯Ό μ‹€ν–‰ν•˜λ©΄ λ©λ‹ˆλ‹€.
$ bash download_original_data.sh
$ pip3 install git+git://github.com/ssut/py-googletrans
$ python3 tranlate_data.py

Tokenizer

  • 데이터셋에 [NAME], [RELIGION]의 Special Token이 μ‘΄μž¬ν•˜μ—¬, 이λ₯Ό vocab.txt의 [unused0]와 [unused1]에 각각 ν• λ‹Ήν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • transformers v2.9.1 κΈ°μ€€μœΌλ‘œ additional_special_tokens에 μœ„μ˜ 두 개의 토큰을 μΆ”κ°€ν•˜μ˜€μŒμ—λ„ μ²˜λ¦¬κ°€ λ˜μ§€ μ•ŠλŠ” μ΄μŠˆκ°€ μžˆμ–΄ configλ₯Ό ν†΅ν•΄μ„œκ°€ μ•„λ‹Œ code λ‹¨μ—μ„œ 직접 λ„£μ–΄μ€˜μ•Ό ν•©λ‹ˆλ‹€. (Pipeline μ½”λ“œ μ°Έκ³ )

Train & Evaluation

  • Sigmoidλ₯Ό μ μš©ν•œ Multi-label classification (thresholdλŠ” 0.3으둜 지정)
    • model.py의 ElectraForMultiLabelClassification μ°Έκ³ 
  • config의 경우 config λ””λ ‰ν† λ¦¬μ˜ json νŒŒμΌμ—μ„œ λ³€κ²½ν•˜λ©΄ λ©λ‹ˆλ‹€.
$ python3 run_goemotions.py --config_file koelectra-base.json
$ python3 run_goemotions.py --config_file koelectra-small.json

Results

Macro F1을 κΈ°μ€€μœΌλ‘œ κ²°κ³Ό μΈ‘μ • (Best result)

Macro F1 (%) Dev Test
KoELECTRA-Small 36.92 37.87
KoELECTRA-Base 40.34 41.54

Pipeline

  • MultiLabelPipeline 클래슀λ₯Ό μƒˆλ‘œ λ§Œλ“€μ–΄ Multi-label classification에 λŒ€ν•œ inferenceκ°€ κ°€λŠ₯ν•˜κ²Œ ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • Huggingface s3에 monologg/koelectra-base-finetuned-goemotions와 monologg/koelectra-small-finetuned-goemotions λͺ¨λΈμ„ μ—…λ‘œλ“œν•˜μ˜€μŠ΅λ‹ˆλ‹€.
from multilabel_pipeline import MultiLabelPipeline
from transformers import ElectraTokenizer
from model import ElectraForMultiLabelClassification
from pprint import pprint


tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-finetuned-goemotions")
tokenizer.add_special_tokens({"additional_special_tokens": ["[NAME]", "[RELIGION]"]})  # BUG: It should be hard-coded on transformers v2.9.1
model = ElectraForMultiLabelClassification.from_pretrained("monologg/koelectra-base-finetuned-goemotions")

goemotions = MultiLabelPipeline(
    model=model,
    tokenizer=tokenizer,
    threshold=0.3
)

texts = [
    "μ „ν˜€ 재미 μžˆμ§€ μ•ŠμŠ΅λ‹ˆλ‹€ ...",
    "λ‚˜λŠ” β€œμ§€κΈˆ κ°€μž₯ 큰 두렀움은 λ‚΄ μƒμž μ•ˆμ— μ‚¬λŠ” 것” 이라고 λ§ν–ˆλ‹€.",
    "κ³±μ°½... ν•œμ‹œκ°„λ°˜ 기닀릴 맛은 μ•„λ‹˜!",
    "μ• μ •ν•˜λŠ” 곡간을 μ• μ •ν•˜λŠ” μ‚¬λžŒλ“€λ‘œ μ±„μšΈλ•Œ",
    "λ„ˆλ¬΄ μ’‹μ•„",
    "λ”₯λŸ¬λ‹μ„ μ§μ‚¬λž‘μ€‘μΈ ν•™μƒμž…λ‹ˆλ‹€!",
    "마음이 급해진닀.",
    "μ•„λ‹ˆ μ§„μ§œ λ‹€λ“€ λ―Έμ³€λ‚˜λ΄¨γ…‹γ…‹γ…‹",
    "κ°œλ…ΈμžΌ"
]

pprint(goemotions(texts))

# Output
[{'labels': ['disapproval'], 'scores': [0.82489157]},
 {'labels': ['fear'], 'scores': [0.9509703]},
 {'labels': ['neutral'], 'scores': [0.9585297]},
 {'labels': ['approval', 'neutral'], 'scores': [0.62351847, 0.34225133]},
 {'labels': ['admiration'], 'scores': [0.97146636]},
 {'labels': ['love', 'neutral'], 'scores': [0.32616842, 0.5455638]},
 {'labels': ['caring', 'nervousness'], 'scores': [0.51289016, 0.4741806]},
 {'labels': ['amusement'], 'scores': [0.9680228]},
 {'labels': ['anger', 'annoyance'], 'scores': [0.5345557, 0.764603]}]

Reference