📚 aiEducation

ML/DL, NLP 관련 공부 기록

🔎 논문 찾는 Tips

paperswithcode에서 tasks 위주로 SOTA 논문을 보여줌
✔ Most implemented paper를 참고하면 어떤 논문이 가장 많이 인용되었는지 확인가능함
github에서 task 검색
이때 awesome [특정 task]로 검색하면 curated list 게시물을 쉽게 찾을 수 있다.
얼마나 중요한 논문인지는 Star 갯수나 fork 수로 판별가능함
ACL, EMNLP, NAACL 등 ACL 계열 학회 같이 h5-index가 높은 학회들(Conference)에서 발표한 논문들로 최신 트렌드를 알 수 있다.

AI h5-index, NLP h5-index

🍃 논문 작성법

Overleaf

Latex는 Conference, Journal 등 논문을 작성할 수 있도록 도와주는 문서 작성 시스템이다. 대다수의 논문들이 Latext를 이용해 작성되고, 공유되어 관리되어 있다. 이러한 Latex 프로그램을 사용해 논문 프로젝트를 편하게 관리하고 공유할 수 있도록 해주는 대표적인 서비스로 Overleaf가 있다. Overleaf > Template에서 검색을 통해 제출할 학회의 논문 Template를 다운받아 작성하면 된다.

사이트 : https://www.overleaf.com/project
사용법 : 나동빈 > 이공계열 학생을 위한 Latex 작성 방법 Feat. Overleaf

▶ LaTex 기호 정리 : https://jjycjnmath.tistory.com/117

⭐ NLP 필수 논문 (년도 순)

논문 년도 순서별로 읽는 걸 추천한다. 왜냐하면, 이전 년도의 나온 논문들을 이해해야 현재 논문을 이해할 수 있기 때문! 예를들어, MASS paper를 알아야 BART paper를 정확히 이해할 수 있다. 또한 유명한 논문들은 대부분 인용되어서 paper에 추가됨. 대표적인 예시) BERT paper에서 GPT와의 비교를 수행. 모두 2019년도에 publish됨.
😀 아래의 models에 대한 공부하기 좋은 PyTorch code : https://github.com/paul-hyun/transformer-evolution

Word2Vec (ICLR 2013) : Paper Link
Seq2Seq (NIPS 2014) : Paper Link / Seq2Seq.pdf / colab
bahdanau Attetion (ICLR 2015) : Paper Link
Transformer (NIPS 2017) : Paper Link
- Pytorch tutorial Harvard's NLP group
GPT (2018) : Paper Link
BERT (NACCL 2019) : Paper Link
GPT-2 (2019) : Paper Link
RoBERTa (ICLR 2019) : Paper Link
GPT-3 (NIPS 2020) : Paper Link
BART (ACL 2020) : Paper Link
- huggingface bart : https://huggingface.co/transformers/v2.11.0/model_doc/bart.html

✔ Summary

	Base model	Pretraining Tasks
ELMo	two-layer biLSTM	next token prediction
GPT	Transformer decoder	next token prediction
BERT	Transformer encoder	mask language model + next sentence prediction
ALBERT	same as BERT but light-weighted	mask language model + sentence order prediction
GPT-2	Transformer decoder	next token prediction
RoBERTa	same as BERT	mask language model (dynamic masking)
T5	Transformer encoder + decoder	pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format.
GPT-3	Transformer decoder	next token prediction
BART	BERT encoder + GPT decoder	reconstruct text from a noised version
ELECTRA	same as BERT	replace token detection

⭐ Text Summarization 추천 논문

Baseline model

baseline model로 BART가 주로 쓰이긴 한다. 하지만 Pegasus model도 XSum dataset에서 많이 쓰인다.

BART (ACL, 2020)
PEGASUS : Pre-training with Extracted Gap-sentences for Abstractive Summarization (ICML, 2020)

Abstractive Summarization

논문을 읽을 때 short와 long paper를 구분해서 읽기 바란다. long과 short paper 각각의 contribution이 크게 차이가 나기 때문.

Abstractive Text Summarization using Sequence-to-Sequence RNNs and beyond (CONLL, 2016)
Text Summarization with Pretrained Encoders (EMNLP, 2019)
RefSum: Refactoring Neural Summarization (NAACL, 2019)
SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization (ACL short, 2021)
SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization (ACL, 2022)
BRIO: Bringing Order to Abstractive Summarization (ACL, 2022)

Extractive Summarization

Extractive Summarization as Text Matching (ACL, 2020)
GSum: General Framework for Guided Neural Summarization (NAACL, 2021)

Contrastive Learning 추천 논문

원래 Computer Vision에서 처음 소개된 기법이기 때문에 비전쪽 논문도 읽는 것을 추천함

Computer Vision

A Simple Framework for Contrastive Learning of Visual Representations (ICML, 2020)
Understanding contrastive representation learning through alignment and uniformity on the hypersphere (ICML, 2020) (처음으로 contrastive learning의 잘되는 핵심적인 이유인 alignment과 uniformity analysis를 제시함.)

NLP

SimCSE: Simple Contrastive Learning of Sentence Embeddings (EMNLP, 2021)
Debiased Contrastive Learning of Unsupervised Sentence Representations (ACL, 2022)
A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-wise Perspective in Angular Space (ACL, 2022)

📊 성능 측정 방법

BLEU

Bilingual Evaluation Understudy
기계번역의 성능이 얼마나 뛰어난가를 측정하기 위해 사용함
기계 번역 결과와 사람이 직접 번역한 결과가 얼마나 유사한지 비교하여 번역에 대한 성능을 측정하는 방법
높을 수록 성능이 더 좋다
장점 : 언어에 구애받지 않음, 계산 속도가 빠름

ROUGE / ROUGE 2.0

Recall-Oriented Understudy for Gisting Evaluation
github : https://github.com/bheinzerling/pyrouge
Text summarization의 성능을 측정하기 위해 사용함
ROUGE는 reference summary와 모델이 생성한 summary 사이에 겹치는 token이 많을수록 score가 높아진다. 하지만, 다른 단어라도 동일한 의미를 가지는 문장을 포함하지 않는다는 한계점이 있어서 이를 보완해서 나온게 ROUGE 2.0이다.
ROUGE 2.0은 synonymous(동의어)와 topic coverage를 포함하여 위의 issue를 보완하였다. → ROUGE - {NN | Topic | TopicUniq} + Synonyms
그러나 여전히 완벽하게 score 매길 수 없지만 현재까지 가장 좋은 Evaluation 방법이라고 평가받는다.

📬 투고

Workshop
- 대규모 학회는 시작할 때 앞뒤로 하루 규모의 workshop를 진행한다. 목적은 본 학회 참석자들이 specific한 키워드를 중심으로 모여서 진행하는 작은 학회같은 느낌. 보통 본 학회 내기 애매하거나 Working in Process를 공유하고 피드백 받는 자리이기도 하다.
- Call for workshop을 열어 committee가 pass/non pass 여부를 주고 다시 그 workshop에서 받을 논문에 대한 공고를 낸다.

Tutorial
- 새로운 논문을 제안하기보다는 급 부상한 새로운 주제에 대한 개론적인 강의를 하는 하루 규모의 세션 (e.g. ACL 2020 open-domain QA tutorial)

Main Conference
- 가장 중요한 메인 컨퍼런스이다. Accepted Papers의 저자들이 Oral 또는 Poster Session으로 Methods를 발표한다.
- ACL 계열 학회들 (ACL, EMNLP, NAACL, EACL, COLING)은 long/short paper로 나눠서 투고한다. 학회마다 기대하는 long/short paper에 대한 스펙이 있기 때문에, call for paper를 참고하는 걸 추천한다. 통상적으로 short paper는 long paper에 비해 상당히 짧고 contribution이 더 작다고 판단된다.
- NAACL call for papers 2022
  - Long paper : (8 pages) substantial, original, completed and unpublished work
  - Short paper : (4 pages) original and unpublished work

Dataset Download

Dataset	Domain	Train	Val	Test	Doc #Tokens	Sum #Tokens
XSum	News	204,045	11,332	11,334	437.21	23.87
CNN/DM	News	287,113	13,368	11,490	782.67	58.33

XSum

hungging face link : https://huggingface.co/datasets/xsum
get dataset

import datasets

dataset = datasets.load_dataset("xsum")

CNN/DM

hugging face link : https://huggingface.co/datasets/cnn_dailymail
get dataset

import datasets

dataset = datasets.load_dataset("cnndm", "3.0.0")  # dataset name, version

🔬 Library

Spacy

link : https://spacy.io/usage/spacy-101#whats-spacy
English 자연어처리를 위한 Python 오픈소스 라이브러리.
지원 기능 : Tokenization, POS Tagging, Dependency Parsing, NER, Similarity ...

import spacy
nlp = spacy.load('en_core_web_sm')

undraa0309/aiEducation