/final-project-level3-nlp-06

final-project-level2-nlp-06 created by GitHub Classroom

Primary LanguagePython

😁 HAPPY 😁: HAte sPeech Purification for You

demo_resize_final

  • λŒ“κΈ€μ„ μž‘μ„±ν•˜λ©΄, λΆ„λ₯˜ λͺ¨λΈμ΄ 혐였 ν‘œν˜„μΈμ§€λ₯Ό λ¨Όμ € νŒλ³„ν•©λ‹ˆλ‹€.
  • 혐였 ν‘œν˜„μœΌλ‘œ λΆ„λ₯˜λ  경우, 토큰 λΆ„λ₯˜ λͺ¨λΈμ„ μ΄μš©ν•΄ λ¬Έμž₯의 μ–΄λŠ 뢀뢄이 혐였 ν‘œν˜„μΈμ§€λ₯Ό μ°Ύμ•„ μ•Œλ €μ€λ‹ˆλ‹€.
  • 생성 λͺ¨λΈμ„ μ΄μš©ν•΄, λ¬Έμž₯의 μˆœν™”λœ λ‚΄μš©μ„ μƒμ„±ν•˜μ—¬ μ‚¬μš©μžμ—κ²Œ μˆœν™” λ°©ν–₯을 μ œμ‹œν•΄ μ€λ‹ˆλ‹€.

μ•…μ„± λŒ“κΈ€ λΆ„λ₯˜ 및 μˆœν™” μž¬μƒμ„± ν”„λ‘œμ νŠΈ

  • λ”₯λŸ¬λ‹μ„ μ΄μš©ν•΄ λŒ“κΈ€μ˜ 혐였 μ—¬λΆ€λ₯Ό λΆ„λ₯˜ν•˜κ³ , ν˜μ˜€ν‘œν˜„μœΌλ‘œ νŒλ‹¨λœ 경우 의미λ₯Ό μœ μ§€ν•œ λ¬Έμž₯을 μž¬μƒμ„±ν•©λ‹ˆλ‹€.
  • 이 과정을 톡해 μ‚¬μš©μžμ˜ λ¬Έμ œμ˜μ‹μ„ μΌμœΌν‚€κ³  자발적 κ°œμ„ μ„ μœ λ„ν•©λ‹ˆλ‹€.

λͺ©μ°¨

  1. νŒ€μ› μ†Œκ°œ
  2. μ„œλΉ„μŠ€ ARCHITECTURE
  3. λͺ¨λΈ ꡬ쑰
  4. 데이터
  5. μΆ”κ°€ 정보

νŒ€μ› μ†Œκ°œ

κΉ€μ€€νœ˜ λ₯˜μž¬ν™˜ λ°•μˆ˜ν˜„ λ°•μŠΉν˜„ μ„€μœ λ―Ό
image image image image image
Classification model
Classification API
Data Collecting
Generation Model
Generation API
Data Collecting
Classification Model
Data Guideline
Data Collecting
Data Checking
Generation Model
Database
BackEnd
FrontEnd
Data Web
Data Collecting
Generation Model
Data Collecting
Data Checking

μ„œλΉ„μŠ€ ARCHITECTURE

service_architecture

λͺ¨λΈ ꡬ쑰

CLASSIFICATION MODEL

classification_model_architecture

  • Backbone modelλ‘œλŠ” κ°€μž₯ 높은 F1 scoreλ₯Ό λ³΄μ΄λ©΄μ„œλ„ 합리적인 μΆ”λ‘  μ‹œκ°„μ„ 보인 πŸ€— beomi/KcElectra-base-v2022 λͺ¨λΈμ„ μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.
    • F1 score 90.88
    • RPS : 173

GENERATION MODEL

generation_model_architecture

  • Reward + Prompt model을 μ΅œμ’… μ±„νƒν–ˆμŠ΅λ‹ˆλ‹€.

데이터

CLASSIFICATION MODEL

  • 혐였 λ¬Έμž₯ λΆ„λ₯˜ λͺ¨λΈμ˜ ν•™μŠ΅μ—λŠ” ν•œκ΅­μ–΄ λ‰΄μŠ€κΈ°μ‚¬ λŒ“κΈ€μ—μ„œ μˆ˜μ§‘ν•œ ν˜μ˜€ν‘œν˜„ 데이터셋인 K-MHaSλ₯Ό μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.
  • ν˜μ˜€ν‘œν˜„ 토큰 λΆ„λ₯˜ λͺ¨λΈμ˜ ν•™μŠ΅μ—λŠ” 넀이버 λ‰΄μŠ€μ™€ 유튜브 μ˜μƒ λŒ“κΈ€μ—μ„œ μˆ˜μ§‘ν•œ ν•œκ΅­μ–΄ ν˜μ˜€ν‘œν˜„ 데이터셋인 KOLDλ₯Ό μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.

GENERATION MODEL : Parallel Dataset μ œμž‘

  • ν˜μ˜€ν‘œν˜„μ„ μ œκ±°ν•˜λ˜ 의미λ₯Ό μœ μ§€ν•œ λ¬Έμž₯ μž¬μƒμ„± ν•™μŠ΅μ„ μœ„ν•΄, 직접 μ‚¬μš©μžμ˜ μ°Έμ—¬λ₯Ό λ°›μ•„ 혐였 ν‘œν˜„ - μˆœν™” ν‘œν˜„ parallel dataset(총 3,133개)을 κ΅¬μΆ•ν–ˆμŠ΅λ‹ˆλ‹€.
  • ν˜μ˜€ν‘œν˜„μ€ APEACH, BEEP!, K-MHaS, KOLD λ°μ΄ν„°μ…‹μ˜ ν˜μ˜€ν‘œν˜„μ„ μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.
  • hate_purificate_parallel_dataset.csv 파일둜 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€

μΆ”κ°€ 정보