awesome-NLP

Interesting NLP resources

Obtain data (Data mining)

Crawling/mining ?
Wrapper induction

Pre-processing

Normalization ?
Tokenization
Document alignment ?

Algorithms

Gale–Church alignment algorithm (https://web.archive.org/web/20061026051708/http://acl.ldc.upenn.edu/J/J93/J93-1004.pdf , https://en.wikipedia.org/w/index.php?title=Gale%E2%80%93Church_alignment_algorithm&oldformat=true)

Representations

Deep Contextualized Word Representations (Peters et al., NAACL 2018) - ELMO
Bert: Pre-training of deep bidirectional transformers for language understanding (Devlin et al., NAACL-HLT 2019) - BERT
Efficient estimation of word representations in vector space (Mikolov et al., ICLR 2013 ?) - word2vec
CoVe
GloVe
FastText?

Features

Benchmark datasets

Stanford Question Answering Dataset (SQuAD) for QA - Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP.
Stanford Natural Language Inference (SNLI) corpus - Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
OntoNotes benchmark - Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bj¨orkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using ontonotes. In CoNLL.
CoNLL 2003 NER task - Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL.
Stanford Sentiment Treebank (SST-5) - Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
GLUE
MTEB
BIG BENCH

Datasets

SemCor 3.0
Wall Street Journal portion of the Penn Treebank (PTB)
Europarl
Pile - Gao, Leo, et al. "The pile: An 800gb dataset of diverse text for language modeling." arXiv preprint arXiv:2101.00027 (2020).

Models

ESIM sequence model - Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In ACL.

Background

John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In ICML.

Tasks

Word Sense Disambiguation (WSD)
Part-of-speech (POS) tagging, syntactic parse tree (Constituency parsing?), dependency grammar (parsing)
Sentiment analysis
Co-reference Resolution
Chunking
Semantic Role Labeling (SRL)
Named Entity Recognition (NER)
Document classification
Question-Answering (QA)
Text summarization
Entailment
Spelling correction
Auto-correction
Information retrieval
Machine Translation
Information extraction ?
Anaphora resolution ?

Evaluation

BLUE (Bilingual Evaluation Understudy) score
ROUGE metric
Perplexity
Transformer-based ? like BERTScore
Simple ones like F1 score, Accuracy, etc.
METEOR
Ranking-based like SQuAD
Correlation-based like MCC, Pearson

Applications ?

Genomic sequences (DNA-based ?)

Books

Speech and Language Processing, Dan Jurafsky and James H. Martin (https://web.stanford.edu/~jurafsky/slp3/)

Blog articles

Common decoding schemes: https://medium.com/nlplanet/two-minutes-nlp-most-used-decoding-methods-for-language-models-9d44b2375612

Courses

NLP coursera course
UMass CS685: Advanced Natural Language Processing (Spring 2023) (https://www.youtube.com/watch?v=EJ8H3Ak_afA&list=PLWnsVgP6CzaelCF_jmn5HrpOXzRAPNjWj&index=1)
Georgia Tech CS 7650: Natural Language Processing (Spring 2023); Alan Ritter (https://aritter.github.io/CS-7650-sp23/)
Stanford; CS224N: Natural Language Processing with Deep Learning; Chris Manning (https://web.stanford.edu/class/cs224n/index.html#coursework)
CSE 5539: Cutting-Edge Topics in Natural Language Processing; Yu Su (https://ysu1989.github.io/courses/au20/cse5539/)

Libraries

NLTK (https://www.nltk.org/)
OpenNMT (https://opennmt.net/)
Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)
torchtext
gensim (https://pypi.org/project/gensim/)
https://github.com/OpenNLPLab

Misc

STRAND by Resnik
Tatoeba Project (https://tatoeba.org/en ; http://www.manythings.org/anki/)

saurabh-kataria/awesome-NLP

awesome-NLP

Obtain data (Data mining)

Pre-processing

Algorithms

Representations

Features

Benchmark datasets

Datasets

Models

Background

Tasks

Evaluation

Applications ?

Books

Blog articles

Courses

Libraries

Misc

References