/mrc-level2-nlp-13

KLUE-MRC Task , Team CLUE

Primary LanguagePythonCreative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0

KLUE Machine Reading Comprehension

1.Project Abstract

โœ‹ KLUE MRC(Machine Reading Comprehension) Dataset์œผ๋กœ ์ฃผ์–ด์ง„ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋ฌธ์„œ ๊ฒ€์ƒ‰ ํ›„ ๋‹ต๋ณ€ ์ถ”์ถœํ•˜๋Š” Task.

โœ‹ Retriver ๋ฅผ ํ†ตํ•ด wikipedia์—์„œ Top-k ๋ฌธ์„œ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ , Reader๋ฅผ ํ†ตํ•ด ๋ฌธ์„œ ๋‚ด ๋‹ต๋ณ€์„ ์ถ”์ถœํ•˜๋Š” ๋ชจ๋ธ์„ ๊ตฌ์ถ•, ์‹คํ—˜ ํ•˜์—ฌ ์ฃผ์–ด์ง„ ์งˆ๋ฌธ์— ์ •ํ™•ํ•œ ๋‹ต๋ณ€์„ ์ฐพ์•„๋‚ด๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ.

โœ‹ 1์ผ ํŒ€ ์ œ์ถœํšŸ์ˆ˜๋Š” 10ํšŒ๋กœ ์ œํ•œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

2. ์„ค์น˜ ๋ฐฉ๋ฒ•

๐Ÿ‘‰ dataset ๋‹ค์šด๋กœ๋“œ

# data (51.2 MB)
tar -xzf data.tar.gz

๐Ÿ‘‰ ํ•ด๋‹น ๋ ˆํฌ ๋‹ค์šด๋กœ๋“œ

git clone https://github.com/boostcampaitech2/mrc-level2-nlp-13.git

๐Ÿ‘‰ Poetry๋ฅผ ํ†ตํ•œ ํŒจํ‚ค์ง€ ๋ฒ„์ „ ๊ด€๋ฆฌ

# curl ์„ค์น˜
apt-get install curl #7.58.0

# poetry ์„ค์น˜
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python

# poetry ํƒญ์™„์„ฑ ํ™œ์„ฑํ™”
~/.bashrc๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ poetry๋ฅผ shell์—์„œ ์‚ฌ์šฉ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐ€์ƒํ™˜๊ฒฝ์— ์ถ”๊ฐ€
poetry use [์‚ฌ์šฉํ•˜๋Š” ๊ฐ€์ƒํ™˜๊ฒฝ์˜ `python path` | ๊ฐ€์ƒํ™˜๊ฒฝ์ด ์‹คํ–‰์ค‘์ด๋ผ๋ฉด `python`]  

# repo download ํ›„ ๋ฒ„์ „ ์ ์šฉ (poetry.toml์— ๋”ฐ๋ผ ์ ์šฉ)
poetry install

3. ๐Ÿ—๏ธ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

3-1. ์ €์žฅ์†Œ ๊ตฌ์กฐ

mrc-level2-nlp-13
โ”œโ”€โ”€ configs
โ”‚   โ””โ”€โ”€ example.json
โ”œโ”€โ”€ model
โ”‚   โ”œโ”€โ”€ Reader
โ”‚   โ”‚   โ”œโ”€โ”€ RobertaCnn.py
โ”‚   โ”‚   โ””โ”€โ”€ trainer_qa.py
โ”‚   โ””โ”€โ”€ Retrieval
โ”‚       โ””โ”€โ”€ retrieval.py
โ”œโ”€โ”€ inference.py
โ”œโ”€โ”€ notebook
โ”‚   โ””โ”€โ”€ post_preprocessing.ipynb
โ”œโ”€โ”€ ensemble
โ”‚   โ””โ”€โ”€ hard_vote.ipynb
โ”œโ”€โ”€ augmentation
โ”‚   โ””โ”€โ”€ quesiton_generate.py
โ”œโ”€โ”€ images
โ”‚   โ””โ”€โ”€ dataset.png
โ”œโ”€โ”€ poetry.lock
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ readme.md
โ”œโ”€โ”€ License.md
โ”œโ”€โ”€ dense_retrieval_train.py
โ”œโ”€โ”€ train_reader.py
โ””โ”€โ”€ utils
    โ”œโ”€โ”€ arguments.py
    โ”œโ”€โ”€ dense_utils
    โ”‚   โ”œโ”€โ”€ retrieval_dataset.py
    โ”‚   โ””โ”€โ”€ utils.py
    โ”œโ”€โ”€ logger.py
    โ””โ”€โ”€ utils_qa.py

3-2.๋ฐ์ดํ„ฐ ๊ตฌ์กฐ

์•„๋ž˜๋Š” ์ œ๊ณตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถ„ํฌ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋ถ„ํฌ

๋ฐ์ดํ„ฐ์…‹์€ ํŽธ์˜์„ฑ์„ ์œ„ํ•ด Huggingface ์—์„œ ์ œ๊ณตํ•˜๋Š” datasets๋ฅผ ์ด์šฉํ•˜์—ฌ pyarrow ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ๋กœ ์ €์žฅ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ตฌ์„ฑ์ž…๋‹ˆ๋‹ค.

./data/                        # ์ „์ฒด ๋ฐ์ดํ„ฐ
    ./train_dataset/           # ํ•™์Šต์— ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹. train ๊ณผ validation ์œผ๋กœ ๊ตฌ์„ฑ 
    ./test_dataset/            # ์ œ์ถœ์— ์‚ฌ์šฉ๋  ๋ฐ์ดํ„ฐ์…‹. validation ์œผ๋กœ ๊ตฌ์„ฑ 
    ./wikipedia_documents.json # ์œ„ํ‚คํ”ผ๋””์•„ ๋ฌธ์„œ ์ง‘ํ•ฉ. retrieval์„ ์œ„ํ•ด ์“ฐ์ด๋Š” corpus.

๋งŒ์•ฝ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ํ†ตํ•œ dataset์„ ์‚ฌ์šฉํ•˜์‹ ๋‹ค๋ฉด, ์ด ๋””๋ ‰ํ† ๋ฆฌ์— ์ถ”๊ฐ€ํ•ด์ฃผ์‹œ๊ณ  config ๋‚ด "data_args" ๋ฅผ ๋ณ€๊ฒฝํ•ด์ฃผ์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

4. train, evaluation , inference

4-1. ๐Ÿš† train

roberta ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ, token type ids๋ฅผ ์‚ฌ์šฉ์•ˆํ•˜๋ฏ€๋กœ tokenizer ์‚ฌ์šฉ์‹œ ์•„๋ž˜ ํ•จ์ˆ˜์˜ ์˜ต์…˜์„ ์ˆ˜์ •ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๋ฒ ์ด์Šค๋ผ์ธ์€ klue/bert-base๋กœ ์ง„ํ–‰๋˜๋‹ˆ ์ด ๋ถ€๋ถ„์˜ ์ฃผ์„์„ ํ•ด์ œํ•˜์—ฌ ์‚ฌ์šฉํ•ด์ฃผ์„ธ์š” ! tokenizer๋Š” train, validation (train.py), test(inference.py) ์ „์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ํ˜ธ์ถœ๋˜์–ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. (tokenizer์˜ return_token_type_ids=False๋กœ ์„ค์ •ํ•ด์ฃผ์–ด์•ผ ํ•จ)

  • ํ•™์Šต์— ํ•„์š”ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ configs directory ๋ฐ‘์— .json ํŒŒ์ผ๋กœ ์ƒ์„ฑํ•˜์—ฌ ์‹คํ—˜์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ํ•™์Šต๋œ ๋ชจ๋ธ์€ tuned_models/"model_name" directory์— bin file์˜ ํ˜•ํƒœ๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.
# train_reader.py
def prepare_train_features(examples):
        # truncation๊ณผ padding(length๊ฐ€ ์งง์„๋•Œ๋งŒ)์„ ํ†ตํ•ด toknization์„ ์ง„ํ–‰ํ•˜๋ฉฐ, stride๋ฅผ ์ด์šฉํ•˜์—ฌ overflow๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
        # ๊ฐ example๋“ค์€ ์ด์ „์˜ context์™€ ์กฐ๊ธˆ์”ฉ ๊ฒน์น˜๊ฒŒ๋ฉ๋‹ˆ๋‹ค.
        tokenized_examples = tokenizer(
            ... ...
            #return_token_type_ids=False, # roberta๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ False, bert๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ True๋กœ ํ‘œ๊ธฐํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
            padding="max_length" if data_args.pad_to_max_length else False,
        )
# train_reader argparser
-c, --config_file_path : train config ์ •๋ณด๊ฐ€ ๋“ค์–ด์žˆ๋Š” json file์˜ ์ด๋ฆ„
-l ,--log_file_path : train logging์„ ํ•  ํŒŒ์ผ ์ด๋ฆ„
-n ,--model_name : ๋ชจ๋ธ์ด ์ €์žฅ๋  ๋””๋ ‰ํ† ๋ฆฌ ์ด๋ฆ„
--do_train : Reader๋ชจ๋ธ train flag
--do_eval : Reader๋ชจ๋ธ validation flag
  • reader ํ•™์Šต ์˜ˆ์‹œ
python train_reader.py -c ./configs/exp1.json -l exp1.log -n experiments1 --do_train
  • dense retriver ํ•™์Šต ์˜ˆ์‹œ
python train_reader.py -c ./configs/dense_exp1.json -l dense_exp1.log -n dense_experiment1 --do_train

4-2. ๐Ÿ“œ eval

MRC ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€(๊ฒ€์ฆ)๋Š” (--do_eval) ํ”Œ๋ ˆ๊ทธ๋ฅผ ๋”ฐ๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์œ„ ํ•™์Šต ์˜ˆ์‹œ์— ๋‹จ์ˆœํžˆ --do_eval ์„ ์ถ”๊ฐ€๋กœ ์ž…๋ ฅํ•ด์„œ ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€๋ฅผ ๋™์‹œ์— ์ง„ํ–‰ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

# mrc ๋ชจ๋ธ ํ‰๊ฐ€ (train/validation ์‚ฌ์šฉ)
python train_reader.py -c ./configs/exp1.json -l exp1.log -n experiments1 --do_train --do_eval

4-3. ๐Ÿฅ• inference

retrieval ๊ณผ mrc ๋ชจ๋ธ์˜ ํ•™์Šต์ด ์™„๋ฃŒ๋˜๋ฉด inference.py ๋ฅผ ์ด์šฉํ•ด odqa ๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ํ•™์Šตํ•œ ๋ชจ๋ธ์˜ test_dataset์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ์ถœํ•˜๊ธฐ ์œ„ํ•ด์„  ์ถ”๋ก (--do_predict)๋งŒ ์ง„ํ–‰ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

  • ํ•™์Šตํ•œ ๋ชจ๋ธ์ด train_dataset ๋Œ€ํ•ด์„œ ODQA ์„ฑ๋Šฅ์ด ์–ด๋–ป๊ฒŒ ๋‚˜์˜ค๋Š”์ง€ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด ํ‰๊ฐ€(--do_eval)๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

# ODQA ์‹คํ–‰ (test_dataset ์‚ฌ์šฉ)
# wandb ๊ฐ€ ๋กœ๊ทธ์ธ ๋˜์–ด์žˆ๋‹ค๋ฉด ์ž๋™์œผ๋กœ ๊ฒฐ๊ณผ๊ฐ€ wandb ์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค. ์•„๋‹ˆ๋ฉด ๋‹จ์ˆœํžˆ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค
# inference argparser
-c, --config_file_path : inference config ์ •๋ณด๊ฐ€ ๋“ค์–ด์žˆ๋Š” json file์˜ ์ด๋ฆ„
-l ,--log_file_path : inference logging์„ ํ•  ํŒŒ์ผ ์ด๋ฆ„
-n ,--inference_name : inference ๊ฒฐ๊ณผ๊ฐ€ ์ €์žฅ๋  ๋””๋ ‰ํ† ๋ฆฌ ์ด๋ฆ„
-m , --model_name_or_path : inference์— ์‚ฌ์šฉํ•  ๋ชจ๋ธ ๋””๋ ‰ํ† ๋ฆฌ์˜ ์ด๋ฆ„
python inference.py -c infer1.json -l infer1.log --n infer1_result -m ./tuned_models/train_dataset/ --do_predict

4-4. How to submit

inference.py ํŒŒ์ผ์„ ์œ„ ์˜ˆ์‹œ์ฒ˜๋Ÿผ --do_predict ์œผ๋กœ ์‹คํ–‰ํ•˜๋ฉด --inference_name ์œ„์น˜์— predictions.json ์ด๋ผ๋Š” ํŒŒ์ผ์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ํŒŒ์ผ์„ ์ œ์ถœํ•ด์ฃผ์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

4-5. MRC ๋ชจ๋ธ ํ•™์Šต ๊ฒฐ๊ณผ

๋‹ค์Œ์€ MRC ๋ชจ๋ธ์˜ public & private datset์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

  • Public 19ํŒ€ ์ค‘ 9๋“ฑ ๐Ÿฅˆ Public ๐Ÿฅˆ

  • Private 19ํŒ€ ์ค‘ 7๋“ฑ ๐Ÿฅˆ Private ๐Ÿฅˆ

5. Things to know

  1. inference.py ์—์„œ TF-IDF score์˜ ๊ฒฝ์šฐ sparse embedding ์„ ํ›ˆ๋ จํ•˜๊ณ  ์ €์žฅํ•˜๋Š” ๊ณผ์ •์€ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ์ง€ ์•Š์•„ ๋”ฐ๋กœ argument ์˜ default ๊ฐ€ True๋กœ ์„ค์ •๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์‹คํ–‰ ํ›„ sparse_embedding.bin ๊ณผ tfidfv.bin ์ด ์ €์žฅ์ด ๋ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ sparse retrieval ๊ด€๋ จ ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•œ๋‹ค๋ฉด, ๊ผญ ๋‘ ํŒŒ์ผ์„ ์ง€์šฐ๊ณ  ๋‹ค์‹œ ์‹คํ–‰ํ•ด์ฃผ์„ธ์š”! ์•ˆ๊ทธ๋Ÿฌ๋ฉด ์กด์žฌํ•˜๋Š” ํŒŒ์ผ์ด load ๋ฉ๋‹ˆ๋‹ค.

  2. ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ --overwrite_cache ๋ฅผ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š์œผ๋ฉด ๊ฐ™์€ ํด๋”์— ์ €์žฅ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  3. ./predictions/ ํด๋” ๋˜ํ•œ --overwrite_output_dir ์„ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š์œผ๋ฉด ๊ฐ™์€ ํด๋”์— ์ €์žฅ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

6. License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License