κΉλ³΄μ± | κΉμ§ν | κΉνμ | λ°μ΄μ | μ΄λ€κ³€ | μ λ―Έμ | μ λν΄ |
---|---|---|---|---|---|---|
Github | Github | Github | Github | Github | Github | Github |
κΉλ³΄μ±
Modeling β’ Reference searching β’ Paper implementation β’ Ensemble β’ github management
κΉμ§ν
FAISS β’ Reference Searching
κΉνμ
Reference Searching β’ ElasticSearch config & Optimization β’ Data Processing β’ Sparse/Dense Retrieval
λ°μ΄μ
Reference Searching β’ Github management
μ΄λ€κ³€
Data Processing β’ Generative MRC
μ λ―Έμ
Data Preprocessing β’ Add Elastic Search into baseline β’ Re-ranking MRC outputs w/ Retrieval β’ Ensemble
μ λν΄
Data Exploration β’ Baseline Abstraction β’ Sparse/Dense Retriever β’ Reader Model Searching β’ Data Augmentation β’ MRC Hyperparameter Tuning β’ Pre/Postprocessing
- Task : Extractive-based MRCλ₯Ό μν ODQA λͺ¨λΈ ꡬμΆ
- Date : 2021.10.12 - 2021.11.04 (4 weeks)
- Description : λ³Έ ODQA λνμμ μ°λ¦¬κ° λ§λ€ λͺ¨λΈμ two-stageλ‘ κ΅¬μ±λμ΄ μμ΅λλ€. 첫 λ¨κ³λ μ§λ¬Έμ κ΄λ ¨λ λ¬Έμλ₯Ό μ°Ύμμ£Όλ "retriever" λ¨κ³μ΄κ³ , λ€μμΌλ‘λ κ΄λ ¨λ λ¬Έμλ₯Ό μ½κ³ μ μ ν λ΅λ³μ μ°Ύκ±°λ λ§λ€μ΄μ£Όλ "reader" λ¨κ³μ λλ€. λ κ°μ§ λ¨κ³λ₯Ό κ°κ° ꡬμ±νκ³ κ·Έκ²λ€μ μ μ ν ν΅ν©νκ² λλ©΄, μ΄λ €μ΄ μ§λ¬Έμ λμ Έλ λ΅λ³μ ν΄μ£Όλ ODQA μμ€ν μ μ¬λ¬λΆλ€ μμΌλ‘ μ§μ λ§λ€μ΄λ³΄κ² λ©λλ€.
- Train : 3,952κ°
- Validation : 240κ°
- Test : 600κ°
λν μ¬μ΄νΈ : [AI stage](https://stages.ai/competitions/77)
AI stageμμ μ 곡ν server, GPU
- GPU: V100
- ODQA Task (Open Domain Question Answering) : Retrieval + Reader λͺ¨λΈμ΄ κ²°ν©λ Hybrid model
- DPR λ Όλ¬Έμ negative sample μΆκ° νμ΅ + Dense Retriever λͺ¨λΈμ μ°¨μ©ν΄ elasticsearchμ κ²°ν©νμ¬ retriever λͺ¨λΈ ꡬν
- GPT-2λ₯Ό νμ©ν΄ wiki λ°μ΄ν°μ contextμ pairedλ μ§μλ₯Ό μμ±ν΄ Retrieval Dense Encoder λͺ¨λΈ νμ΅
- Data Augmentationμ ν΅ν΄ μ§λ¬Έμ κΈΈμ΄λ₯Ό λλ¦° ν νμ΅ λ°μ΄ν°λ‘ μ΄μ©
- λλμ νκ΅μ΄ λ°μ΄ν°λ‘ μ¬μ νμ΅ λμ΄ μλ
klue/roberta-large
λͺ¨λΈμ 리λ λͺ¨λΈλ‘ μ¬μ©
- EDA
- Data Preprocessing(
special character removal
,getting answer spans' start position with special character tokens
) - Data Augmentation(
Back translation
,Question generation
) - Data Postprocessing
- Experimental Logging (
WandB
) - Retrieval (
dense -- FAISS,using simple dual-encoders
,sparse -- TF-IDF,BM25,Elastic search
,Dense+Sparse -- using a linear combination of dense and sparse scores as the new raking function
) - Custom Model Architecture(
Roberta with BiLSTM
,Roberta with Autoencoder
) - Re-ranker ( combining the reader score with the retriever score via linear combination
inspired by BERTserini
) - Ensemble
- Don't stop Pretraining (additional MLM Task, TAPT + DAPT)
- K-fold cross validation
- Shorten inference time when using elastic search
Tried Experiments | Pipeline | Performance Improvement |
---|---|---|
TF-IDF |
Retrieval |
|
ElasticSearch config setting |
Retrieval |
|
Question Generation (using GPT-2) |
Retrieval |
|
hard negative (using BM25 + ElasticSearch) |
Retrieval |
|
DPR implementation |
Retrieval |
|
Dense+Sparse |
Retrieval |
|
Roberta with Bi-LSTM |
Reader |
|
Roberta with Autoencoder |
Reader |
|
Back-Translation |
Reader |
|
Context Concat(hard negative) |
Reader |
|
Retrival+Reader Re-Ranker |
Inference |
λ€μκ³Ό κ°μ λͺ λ Ήμ΄λ‘ νμν librariesλ₯Ό λ€μ΄ λ°μ΅λλ€.
pip install -r requirements.txt
Elasticsearch λͺ¨λ (μΆμ² : μμ€μ λ©ν λ κΉνλΈ)
apt-get update && apt-get install -y gnupg2
wget -qO - [https://artifacts.elastic.co/GPG-KEY-elasticsearch](https://artifacts.elastic.co/GPG-KEY-elasticsearch) | apt-key add -
apt-get install apt-transport-https
echo "deb [https://artifacts.elastic.co/packages/7.x/apt](https://artifacts.elastic.co/packages/7.x/apt) stable main" | tee /etc/apt/sources.list.d/elastic-7.x.list
apt-get update && apt-get install elasticsearch
service elasticsearch start
cd /usr/share/elasticsearch
bin/elasticsearch-plugin install analysis-nori
service elasticsearch restart
pip install elasticsearch
BM25 λͺ¨λ
pip install rank_bm25
Google deep_translator λͺ¨λ
pip install -U deep-translator
νμΌ: data/train_dataset/train, data/train_dataset/validation, data/test_dataset/validation
νμΌ: code/notebooks/(folder)
νμΌ: preprocess.py, process_data.py, back_translation.py
νμΌ: train.py, inference.py, golden_retriever.py, golden_serini.py, inference_serini.py
νμΌ: mixing_bowl.ipynb, mixing_bowl (1).ipynb
.
βββ mrc-level2-nlp-07
| βββ code
β βββ outputs
β βββ dense_encoder
β βββ retriever
| βββ data
β βββ train_dataset
| βββ train
| βββ validation
β βββ test_dataset
| βββ validation
| βββ wikipedia_passages.json
code
νμΌ μμλ κ°κ° data preprocessing β’ train β’ inferenceκ° κ°λ₯ν λΌμ΄λΈλ¬λ¦¬κ° λ€μ΄μμ΅λλ€.train.py
λ₯Ό μ€νμν€λ©΄ logs, results, best_model ν΄λμ κ²°κ³Όλ€μ΄ μ μ₯λ©λλ€.- μ¬μ©μλ μ 체 μ½λλ₯Ό λ΄λ €λ°μ ν, argument μ΅μ μ μ§μ νμ¬ κ°λ³ λΌμ΄λΈλ¬λ¦¬ λͺ¨λΈμ νμ©ν μ μμ΅λλ€.