/level2_dataannotation_nlp-level2-nlp-06

Data Annotation Project @ boostcamp AI Tech, NAVER Connect Foundation, Fall 2022

Primary LanguagePython

๐Ÿ˜ HAPPY Entertainment Dataset ๐Ÿ˜

๋„ค์ด๋ฒ„ ๋ถ€์ŠคํŠธ์บ ํ”„ NLP 6์กฐ HAPPYํŒ€์ด ์ œ์ž‘ํ•œ ๊ตญ๋‚ด ์˜ˆ๋Šฅ domain-specific dataset์ž…๋‹ˆ๋‹ค.
์ด ๋ฐ์ดํ„ฐ์…‹์€ Relation Extraction Task๋ฅผ ์œ„ํ•ด ์ œ์ž‘๋˜์—ˆ์œผ๋ฉฐ ํ•œ๊ตญ์–ด ์œ„ํ‚ค๋ฐฑ๊ณผ๋ฅผ ํฌ๋กค๋งํ–ˆ์Šต๋‹ˆ๋‹ค.

โ— DISCLAIMER โ— ๋ณธ Dataset์€ ํ˜์˜ค ๋ฐ ์ฐจ๋ณ„์  ํ‘œํ˜„์„ ํฌํ•จํ•˜๊ณ  ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ด๋žŒ์— ์ฃผ์˜ํ•˜์„ธ์š”

Dataset Size

full dataset : 1663
train dataset : 1332
valid dataset : 167
test dataset: 168

๐Ÿ“• Dataset Description

Label Distribution

label distribution

Length Distribution

sentlen distribution

๐Ÿ“˜ Relation Map & Guideline & Example Sentences

Guideline: 6์กฐ(HAPPY)-์˜ˆ๋Šฅ ๊ฐ€์ด๋“œ๋ผ์ธ.pdf

Relation map:

class_name(ko) class_name(eng) direction(sub, obj) description
๊ด€๊ณ„_์—†์Œ no_relation (,) ๊ด€๊ณ„๋ฅผ ์œ ์ถ”ํ•  ์ˆ˜ ์—†๊ฑฐ๋‚˜ ์ •์˜๋œ relation ์œผ๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์—†์Œ
๋‹จ์ฒด:๋ณ„์นญ org:alternate_name (ORG,ORG) object๋Š” subject์˜ ๋ณ„์นญ
๋‹จ์ฒด:์†Œ์†์ธ org:employee (ORG,PER) object๋Š” subject์— ์ข…์‚ฌํ•˜๋Š” ์‚ฌ๋žŒ
๋‹จ์ฒด:ํ”„๋กœ๊ทธ๋žจ org:program (ORG,PRO) object๋Š” subject์˜ ํ”„๋กœ๊ทธ๋žจ
์ธ๋ฌผ:๋ณ„์นญ per:alternate_name (PER:PER/POH) object๋Š” subject์˜ ๋ณ„์นญ
์ธ๋ฌผ:๋™๋ฃŒ per:colleagues (PER:PER) object๋Š” subject์™€ ๋ฌธ์žฅ ๋‚ด ๊ณตํ†ต ์†Œ์† ๋ช…์‹œ๋˜์–ด ์žˆ๋Š” ์‚ฌ๋žŒ
์ธ๋ฌผ:์‚ฌ๊ฑด per:event (PER:POH) object๋Š” subject๊ฐ€ ์—ฐ๋ฃจ๋œ ์‚ฌ๊ฑด
์ธ๋ฌผ:์†Œ์†๋‹จ์ฒด per:member_of (PER:ORG) object๋Š” subject๊ฐ€ ์†ํ–ˆ๋˜/์†ํ•œ ๋‹จ์ฒด
์ธ๋ฌผ:์ฐธ์—ฌํ”„๋กœ๊ทธ๋žจ per:participate_in (PER:PRO) object๋Š” subject๊ฐ€ ์ฐธ์—ฌํ• /์ฐธ์—ฌํ•˜๋Š”/์ฐธ์—ฌํ–ˆ๋˜ ํ”„๋กœ๊ทธ๋žจ
์ธ๋ฌผ:์ง์—…/์งํ•จ per:title (PER:POH) object๋Š” subject์˜ ๊ณผ๊ฑฐ/ํ˜„์žฌ ์ง์—…/์งํ•จ
ํ”„๋กœ:๋ฐฉ์†ก์‹œ๊ฐ„ pro:air_time (PRO:DAT) object๋Š” subject์˜ ๋ฐฉ์†ก ์‹œ์ž‘, ์ข…๋ฃŒ, ์ง€์†์‹œ๊ฐ„
ํ”„๋กœ:์ข…์˜์ผ pro:end_at (PRO:DAT) object๋Š” subject์˜ ์ข…์˜์ผ
ํ”„๋กœ:๋ฐฉ์˜์‹œ์ž‘์ผ pro:start_at (PRO:DAT) object๋Š” subject์˜ ๋ฐฉ์˜์‹œ์ž‘์ผ
ํ”„๋กœ:ํ•˜์œ„_ํ”„๋กœ pro:subprogram (PRO:PRO/POH) object๋Š” subject๋‚ด ์ฝ”๋„ˆ/์—ํ”ผ์†Œ๋“œ/์‹œ์ฆŒ

Example Sentences

org:alternate_name:์ดํ›„ ๊ตญ๋‚ด ์ตœ์ดˆ์˜ ๋ฏผ๊ฐ„ ๋ฐฉ์†ก์ธ CBS๊ธฐ๋…๊ต๋ฐฉ์†ก, 1954๋…„ 12์›” 15์ผ์— ๊ฐœ๊ตญํ•˜์˜€๊ณ , ๋ถ€์‚ฐ์—์„œ ์ตœ์ดˆ์˜ ์ƒ์—… ๋ฐฉ์†ก์ธ ๋ฌธํ™”๋ฐฉ์†ก(MBC)์ด 1961๋…„ 12์›”์— ๊ฐœ๊ตญํ•˜์˜€๋‹ค.

org:employee:์œ ์žฌ์„์€ ์‹ฌํ˜•๋ž˜์™€๋Š” ์˜ํ™” ์ดฌ์˜์„ ํ†ตํ•ด ๊ฐ™์ด ์ธ์—ฐ์„ ๋งบ์–ด์˜จ ์‚ฌ์ด์ด๋ฉฐ KBS ๊ณต์ฑ„ 7๊ธฐ ๊ฐœ๊ทธ๋งจ์œผ๋กœ์„œ KBS ๊ณต์ฑ„ ๊ฐœ๊ทธ๋งจ ์ค‘ ๊ฝƒ์ด๋ผ ๋ถˆ๋ฆฌ๋Š” 7๊ธฐ ๋ฉค๋ฒ„๋“ค๊ณผ๋„ ๊นŠ์€ ์นœ๋ถ„์„ ์œ ์ง€ํ•˜๊ณ  ์žˆ๋‹ค.

org:program:ใ€Š์„ธ๊ณ„ํ…Œ๋งˆ๊ธฐํ–‰ใ€‹์€ ๋Œ€ํ•œ๋ฏผ๊ตญ์˜ EBS 1TV์—์„œ ๋งค์ฃผ ์›”์š”์ผ๋ถ€ํ„ฐ ๊ธˆ์š”์ผ๊นŒ์ง€ ์ €๋… 8์‹œ 40๋ถ„์— ๋ฐฉ์†ก ์ค‘์ธ ์—ฌํ–‰ ์ „๋ฌธ ๊ต์–‘ ํ”„๋กœ๊ทธ๋žจ์ด๋‹ค.

per:alternate_name:์ด ์‹œ๊ธฐ์— ๋ฌดํ•œ๋„์ „์— ํŠน๋ณ„ํžˆ ์ถœ์—ฐํ–ˆ๋˜ ๊ฒŒ์ŠคํŠธ๋“ค๋กœ๋Š” ๋ฐฐ์šฐ ์ฐจ์Šน์›, ์„ธ๊ณ„ ํ…Œ๋‹ˆ์Šค์˜ ์š”์ • ๋งˆ๋ฆฌ์•„ ์ƒค๋ผํฌ๋ฐ” ๋“ฑ์ด ์žˆ๋‹ค. 1๊ธฐ์˜ ๋งˆ์ง€๋ง‰ ํŽธ์—์„œ๋Š” '๋†€์ด๊ธฐ๊ตฌ์—์„œ ๋ฆฝ์Šคํ‹ฑ ๋ฐ”๋ฅด๊ธฐ'๊ฐ€ ๋„์ „ ๊ณผ์ œ์˜€์œผ๋ฉฐ, ๊ฒŒ์ŠคํŠธ๋กœ ๊ทธ๋ฃน ์Šˆ๊ฐ€๊ฐ€ ์ถœ์—ฐํ•˜์˜€๋‹ค.

per:colleagues:5ํšŒ ์ž์œ ์—ฌํ–‰์€ ๋ช…ํ’ˆ ์กฐ์—ฐ ํŠน์ง‘์œผ๋กœ ๊ฐ•ํ˜ธ๋™๊ณผ ์„ฑ๋™์ผ์ด ๋Œ€์žฅ์„ ๋งก์•˜๋‹ค.

per:event:์ด๋Š” ๊ฒฐ๊ตญ ํ”„๋ผ์ด๋จธ๋ฆฌ์˜ ํ‘œ์ ˆ ์˜ํ˜น์œผ๋กœ ์ด์–ด์กŒ๊ณ , ์นด๋กœ ์—๋ฉ”๋ž„๋“œ ์ธก์—์„œ๋„ "ํ‘œ์ ˆ์ด ๋งž๋‹ค."๋ผ๊ณ  ์ž…์žฅ์„ ํ‘œํ•˜์˜€๋‹ค.

per:member_of:๊ทธ ์ดํ›„ ์ปจ์ธ„๋ฆฌ ๊ผฌ๊ผฌ๋Š” ๋Œ€๋ฐ•์ด ๋‚ฌ์œผ๋ฉฐ ์‹ ์ •ํ™˜๊ณผ ํƒ์žฌํ›ˆ์ด ์—„์ฒญ๋‚˜๊ฒŒ ์œ ๋ช…์„ธ๋ฅผ ํƒ”๋Š”๋ฐ ํƒ์žฌํ›ˆ์€ ์ด ์œ ๋ช…์„ธ๋ฅผ ์ด์šฉํ•ด์„œ ์˜ํ™” ๋ฐฐ์šฐ๋กœ ๋ฐ๋ท”ํ•˜๊ณ  ๊น€์ˆ˜๋ฏธ์™€ ๊ฐ™์ด ์—ฌ๋Ÿฌ ์˜ํ™”๋ฅผ ์ดฌ์˜ํ–ˆ๋‹ค.

per:participate_in:ํƒ์žฌํ›ˆ์€ ๊ทธ ํƒ“์— ์ƒ๊ณ„์— ๊ณค๋ž€์„ ๊ฒช์—ˆ์œผ๋ฉฐ ์ƒ๊ณ„ ๋•Œ๋ฌธ์— ๊ฒฝ์ฐฐ์ฒญ ์‚ฌ๋žŒ๋“ค์—์„œ ๋‹จ์—ญ์œผ๋กœ ์ถœ์—ฐํ•˜์—ฌ ๋ฒ”์ฃ„์ž(๋„๋‘‘) ์—ญํ• ์„ ํ–ˆ๋‹ค.

per:title:์œ ์žฌ์„(ๅŠ‰ๅœจ้Œซ, Yu Jae-Seok, 1972๋…„ 8์›” 14์ผ ~ )์€ ๋Œ€ํ•œ๋ฏผ๊ตญ์˜ ๋ฐฉ์†ก์ธ, MC, ํฌ๊ทน ๋ฐฐ์šฐ์ด๋‹ค.

pro:air_time:ใ€ŠTV ๋™๋ฌผ๋†์žฅใ€‹์€ ๋งค์ฃผ ์ผ์š”์ผ ์˜ค์ „ 9์‹œ 30๋ถ„์— ๋ฐฉ์†ก๋˜๋Š” ๋™๋ฌผ ์ „๋ฌธ ๊ต์–‘ ํ”„๋กœ๊ทธ๋žจ์ด๋‹ค.

pro:end_at:๋ฌดํ•œ๋„์ „์€ 2005๋…„ 4์›” 23์ผ๋ถ€ํ„ฐ 2018๋…„ 3์›” 31์ผ๊นŒ์ง€ MBC TV์—์„œ ๋ฐฉ์˜๋˜์—ˆ๋˜ ํ…”๋ ˆ๋น„์ „ ํ”„๋กœ๊ทธ๋žจ์ด๋‹ค.

pro:start_at:ใ€Š๊ฐœ๊ทธ์ฝ˜์„œํŠธใ€‹๋Š” 1999๋…„ 9์›” 4์ผ๋ถ€ํ„ฐ 2020๋…„ 6์›” 26์ผ๊นŒ์ง€ ๋ฐฉ์†ก๋˜์—ˆ๋˜ ์ฝ”๋ฏธ๋”” ํ”„๋กœ๊ทธ๋žจ์ด๋‹ค.

pro:subprogram:๋ฐ”์Šค์ผ“์„ ํƒ€๊ณ  ์˜ฌ๋ผ๊ฐ€๋Š” ๊ฒƒ์— ๋Œ€ํ•ด ๋‘๋ ค์›Œํ•˜๋Š” ๋ถ€๋ถ„์ด ๊ฐ™์€ ๋ฐฉ์†ก์‚ฌ ํ”„๋กœ๊ทธ๋žจ ์ผ์š”์ผ ์ผ์š”์ผ ๋ฐค์—์˜ ์ฝ”๋„ˆ์ธ '๋ถˆ๊ฐ€๋Šฅ์€ ์—†๋‹ค'์™€ ํก์‚ฌํ•˜๋‹ค๋Š” ๋‚ด์šฉ์ด์—ˆ๋‹ค.

Train Scores

Pretrained Model Micro F1 Auprc Accuracy
klue/bert-base 87.649 90.582 0.8468
klue/roberta-small 81.633 81.783 0.8138
klue/roberta-base 83.333 90.080 0.8288
klue/roberta-large 88.095 88.708 0.8559
monologg/koelectra-small-v3-discriminator 40.976 36.338 0.5045
monologg/koelectra-base-v3-discriminator 69.959 69.417 0.7087
jinmang2/kpfbert 85.600 84.315 0.8468

๐Ÿ‘จ Participants

PM : ๋ฅ˜์žฌํ™˜

๊น€์ค€ํœ˜, ๋ฐ•์ˆ˜ํ˜„, ๋ฐ•์Šนํ˜„, ์„ค์œ ๋ฏผ

๐Ÿ“— Wrap-up Report

NLP_6์กฐ_๋ฐ์ดํ„ฐ_์ œ์ž‘_๋žฉ์—…_๋ฆฌํฌํŠธ.pdf

๐Ÿ“‘ License

์›๋ฌธ ์ถœ์ฒ˜ : https://ko.wikipedia.org/wiki/