Interesting NLP resources
- Crawling/mining ?
- Wrapper induction
- Normalization ?
- Tokenization
- Document alignment ?
- Gale–Church alignment algorithm (https://web.archive.org/web/20061026051708/http://acl.ldc.upenn.edu/J/J93/J93-1004.pdf , https://en.wikipedia.org/w/index.php?title=Gale%E2%80%93Church_alignment_algorithm&oldformat=true)
- Deep Contextualized Word Representations (Peters et al., NAACL 2018) - ELMO
- Bert: Pre-training of deep bidirectional transformers for language understanding (Devlin et al., NAACL-HLT 2019) - BERT
- Efficient estimation of word representations in vector space (Mikolov et al., ICLR 2013 ?) - word2vec
- CoVe
- GloVe
- FastText?
- Stanford Question Answering Dataset (SQuAD) for QA - Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP.
- Stanford Natural Language Inference (SNLI) corpus - Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
- OntoNotes benchmark - Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bj¨orkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using ontonotes. In CoNLL.
- CoNLL 2003 NER task - Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL.
- Stanford Sentiment Treebank (SST-5) - Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
- GLUE
- MTEB
- BIG BENCH
- SemCor 3.0
- Wall Street Journal portion of the Penn Treebank (PTB)
- Europarl
- Pile - Gao, Leo, et al. "The pile: An 800gb dataset of diverse text for language modeling." arXiv preprint arXiv:2101.00027 (2020).
- ESIM sequence model - Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In ACL.
- John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML.
- Word Sense Disambiguation (WSD)
- Part-of-speech (POS) tagging, syntactic parse tree (Constituency parsing?), dependency grammar (parsing)
- Sentiment analysis
- Co-reference Resolution
- Chunking
- Semantic Role Labeling (SRL)
- Named Entity Recognition (NER)
- Document classification
- Question-Answering (QA)
- Text summarization
- Entailment
- Spelling correction
- Auto-correction
- Information retrieval
- Machine Translation
- Information extraction ?
- Anaphora resolution ?
- BLUE (Bilingual Evaluation Understudy) score
- ROUGE metric
- Perplexity
- Transformer-based ? like BERTScore
- Simple ones like F1 score, Accuracy, etc.
- METEOR
- Ranking-based like SQuAD
- Correlation-based like MCC, Pearson
- Genomic sequences (DNA-based ?)
- Speech and Language Processing, Dan Jurafsky and James H. Martin (https://web.stanford.edu/~jurafsky/slp3/)
- Common decoding schemes: https://medium.com/nlplanet/two-minutes-nlp-most-used-decoding-methods-for-language-models-9d44b2375612
- NLP coursera course
- UMass CS685: Advanced Natural Language Processing (Spring 2023) (https://www.youtube.com/watch?v=EJ8H3Ak_afA&list=PLWnsVgP6CzaelCF_jmn5HrpOXzRAPNjWj&index=1)
- Georgia Tech CS 7650: Natural Language Processing (Spring 2023); Alan Ritter (https://aritter.github.io/CS-7650-sp23/)
- Stanford; CS224N: Natural Language Processing with Deep Learning; Chris Manning (https://web.stanford.edu/class/cs224n/index.html#coursework)
- CSE 5539: Cutting-Edge Topics in Natural Language Processing; Yu Su (https://ysu1989.github.io/courses/au20/cse5539/)
- NLTK (https://www.nltk.org/)
- OpenNMT (https://opennmt.net/)
- Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)
- torchtext
- gensim (https://pypi.org/project/gensim/)
- https://github.com/OpenNLPLab
- STRAND by Resnik
- Tatoeba Project (https://tatoeba.org/en ; http://www.manythings.org/anki/)