/mrc-level2-nlp-07

mrc-level2-nlp-07 created by GitHub Classroom

Primary LanguagePython

Boostcamp Machine Reading Comprehension Competition

Table of contents

  1. Introduction
  2. Project Outline
  3. Solution
  4. How to Use

1. Introduction



🐢 TEAM : 쑰지KLUEλ‹ˆ

πŸ”… Members

김보성 김지후 κΉ€ν˜œμˆ˜ 박이삭 이닀곀 전미원 정두해
image1 image2 image3 image4 image5 image6 image7
Github Github Github Github Github Github Github

πŸ”… Contribution

김보성  Modeling β€’ Reference searching β€’ Paper implementation β€’ Ensemble β€’ github management

김지후 FAISS β€’ Reference Searching

κΉ€ν˜œμˆ˜  Reference Searching β€’ ElasticSearch config & Optimization β€’ Data Processing β€’ Sparse/Dense Retrieval

박이삭  Reference Searching β€’ Github management

이닀곀  Data Processing β€’ Generative MRC

전미원  Data Preprocessing β€’ Add Elastic Search into baseline β€’ Re-ranking MRC outputs w/ Retrieval β€’ Ensemble

정두해  Data Exploration β€’ Baseline Abstraction β€’ Sparse/Dense Retriever β€’ Reader Model Searching β€’ Data Augmentation β€’ MRC Hyperparameter Tuning β€’ Pre/Postprocessing


2. Project Outline

  • Task : Extractive-based MRCλ₯Ό μœ„ν•œ ODQA λͺ¨λΈ ꡬ좕
  • Date : 2021.10.12 - 2021.11.04 (4 weeks)
  • Description : λ³Έ ODQA λŒ€νšŒμ—μ„œ μš°λ¦¬κ°€ λ§Œλ“€ λͺ¨λΈμ€ two-stage둜 κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. μ²« λ‹¨κ³„λŠ” μ§ˆλ¬Έμ— κ΄€λ ¨λœ λ¬Έμ„œλ₯Ό μ°Ύμ•„μ£ΌλŠ” "retriever" λ‹¨κ³„이고, λ‹€μŒμœΌλ‘œλŠ” κ΄€λ ¨λœ λ¬Έμ„œλ₯Ό 읽고 μ μ ˆν•œ 닡변을 μ°Ύκ±°λ‚˜ λ§Œλ“€μ–΄μ£ΌλŠ” "reader" λ‹¨κ³„μž…λ‹ˆλ‹€. 두 가지 단계λ₯Ό 각각 κ΅¬μ„±ν•˜κ³  그것듀을 적절히 ν†΅ν•©ν•˜κ²Œ 되면, μ–΄λ €μš΄ μ§ˆλ¬Έμ„ λ˜μ Έλ„ 닡변을 ν•΄μ£ΌλŠ” ODQA μ‹œμŠ€ν…œμ„ μ—¬λŸ¬λΆ„λ“€ μ†μœΌλ‘œ 직접 λ§Œλ“€μ–΄λ³΄κ²Œ λ©λ‹ˆλ‹€.
  • Train : 3,952개
  • Validation : 240개
  • Test : 600개

πŸ† Final Score



λŒ€νšŒ μ‚¬μ΄νŠΈ : [AI stage](https://stages.ai/competitions/77)

Hardware

AI stageμ—μ„œ μ œκ³΅ν•œ server, GPU

  • GPU: V100

3. Solution

KEY POINT

  • ODQA Task (Open Domain Question Answering) : Retrieval + Reader λͺ¨λΈμ΄ κ²°ν•©λœ Hybrid model
  • DPR λ…Όλ¬Έμ˜ negative sample μΆ”κ°€ ν•™μŠ΅ + Dense Retriever λͺ¨λΈμ„ μ°¨μš©ν•΄ elasticsearch와 κ²°ν•©ν•˜μ—¬ retriever λͺ¨λΈ κ΅¬ν˜„
  • GPT-2λ₯Ό ν™œμš©ν•΄ wiki λ°μ΄ν„°μ˜ context에 paired된 질의λ₯Ό 생성해 Retrieval Dense Encoder λͺ¨λΈ ν•™μŠ΅
  • Data Augmentation을 톡해 μ§€λ¬Έμ˜ 길이λ₯Ό 늘린 ν›„ ν•™μŠ΅ λ°μ΄ν„°λ‘œ 이용
  • λŒ€λŸ‰μ˜ ν•œκ΅­μ–΄ λ°μ΄ν„°λ‘œ μ‚¬μ „ν•™μŠ΅ λ˜μ–΄ μžˆλŠ” klue/roberta-large λͺ¨λΈμ„ 리더 λͺ¨λΈλ‘œ μ‚¬μš©

Checklist

  • EDA
  • Data Preprocessing(special character removal, getting answer spans' start position with special character tokens)
  • Data Augmentation(Back translation, Question generation)
  • Data Postprocessing
  • Experimental Logging (WandB)
  • Retrieval (dense -- FAISS,using simple dual-encoders, sparse -- TF-IDF,BM25,Elastic search, Dense+Sparse -- using a linear combination of dense and sparse scores as the new raking function)
  • Custom Model Architecture(Roberta with BiLSTM, Roberta with Autoencoder)
  • Re-ranker ( combining the reader score with the retriever score via linear combination inspired by BERTserini)
  • Ensemble
  • Don't stop Pretraining (additional MLM Task, TAPT + DAPT)
  • K-fold cross validation
  • Shorten inference time when using elastic search

Experiments

Tried Experiments Pipeline Performance Improvement
TF-IDF Retrieval
ElasticSearch config setting Retrieval
Question Generation (using GPT-2) Retrieval
hard negative (using BM25 + ElasticSearch) Retrieval
DPR implementation Retrieval
Dense+Sparse Retrieval
Roberta with Bi-LSTM Reader
Roberta with Autoencoder Reader
Back-Translation Reader
Context Concat(hard negative) Reader
Retrival+Reader Re-Ranker Inference

4. How to Use

Installation

λ‹€μŒκ³Ό 같은 λͺ…λ Ήμ–΄λ‘œ ν•„μš”ν•œ librariesλ₯Ό λ‹€μš΄ λ°›μŠ΅λ‹ˆλ‹€.

pip install -r requirements.txt

Elasticsearch λͺ¨λ“ˆ (좜처 : μ„œμ€‘μ› λ©˜ν† λ‹˜ κΉƒν—ˆλΈŒ)

apt-get update && apt-get install -y gnupg2
wget -qO - [https://artifacts.elastic.co/GPG-KEY-elasticsearch](https://artifacts.elastic.co/GPG-KEY-elasticsearch) | apt-key add -
apt-get install apt-transport-https
echo "deb [https://artifacts.elastic.co/packages/7.x/apt](https://artifacts.elastic.co/packages/7.x/apt) stable main" | tee /etc/apt/sources.list.d/elastic-7.x.list
apt-get update && apt-get install elasticsearch
service elasticsearch start
cd /usr/share/elasticsearch
bin/elasticsearch-plugin install analysis-nori
service elasticsearch restart
pip install elasticsearch

BM25 λͺ¨λ“ˆ

pip install rank_bm25

Google deep_translator λͺ¨λ“ˆ

pip install -U deep-translator

Dataset

파일: data/train_dataset/train, data/train_dataset/validation, data/test_dataset/validation

Data Analysis

파일: code/notebooks/(folder)

Data preprocessing

파일: preprocess.py, process_data.py, back_translation.py

Modeling

파일: train.py, inference.py, golden_retriever.py, golden_serini.py, inference_serini.py

Ensemble

파일: mixing_bowl.ipynb, mixing_bowl (1).ipynb

Directory

.
β”œβ”€β”€ mrc-level2-nlp-07
|    β”œβ”€β”€ code
β”‚        β”œβ”€β”€ outputs
β”‚        β”œβ”€β”€ dense_encoder
β”‚        β”œβ”€β”€ retriever
|    β”œβ”€β”€ data
β”‚        β”œβ”€β”€ train_dataset
|            β”œβ”€β”€ train
|            β”œβ”€β”€ validation
β”‚        β”œβ”€β”€ test_dataset
|            β”œβ”€β”€ validation
|        β”œβ”€β”€ wikipedia_passages.json
  • code 파일 μ•ˆμ—λŠ” 각각 data preprocessing β€’ train β€’ inferenceκ°€ κ°€λŠ₯ν•œ λΌμ΄λΈŒλŸ¬λ¦¬κ°€ λ“€μ–΄μžˆμŠ΅λ‹ˆλ‹€.
  • train.pyλ₯Ό μ‹€ν–‰μ‹œν‚€λ©΄ logs, results, best_model 폴더에 결과듀이 μ €μž₯λ©λ‹ˆλ‹€.
  • μ‚¬μš©μžλŠ” 전체 μ½”λ“œλ₯Ό 내렀받은 ν›„, argument μ˜΅μ…˜μ„ μ§€μ •ν•˜μ—¬ κ°œλ³„ 라이브러리 λͺ¨λΈμ„ ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.