
Interesting NLP resources

Obtain data (Data mining)

  1. Crawling/mining ?
  2. Wrapper induction


  1. Normalization ?
  2. Tokenization
  3. Document alignment ?


  1. Gale–Church alignment algorithm (https://web.archive.org/web/20061026051708/http://acl.ldc.upenn.edu/J/J93/J93-1004.pdf , https://en.wikipedia.org/w/index.php?title=Gale%E2%80%93Church_alignment_algorithm&oldformat=true)


  1. Deep Contextualized Word Representations (Peters et al., NAACL 2018) - ELMO
  2. Bert: Pre-training of deep bidirectional transformers for language understanding (Devlin et al., NAACL-HLT 2019) - BERT
  3. Efficient estimation of word representations in vector space (Mikolov et al., ICLR 2013 ?) - word2vec
  4. CoVe
  5. GloVe
  6. FastText?


Benchmark datasets

  1. Stanford Question Answering Dataset (SQuAD) for QA - Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP.
  2. Stanford Natural Language Inference (SNLI) corpus - Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  3. OntoNotes benchmark - Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Bj¨orkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using ontonotes. In CoNLL.
  4. CoNLL 2003 NER task - Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL.
  5. Stanford Sentiment Treebank (SST-5) - Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
  6. GLUE
  7. MTEB


  1. SemCor 3.0
  2. Wall Street Journal portion of the Penn Treebank (PTB)
  3. Europarl
  4. Pile - Gao, Leo, et al. "The pile: An 800gb dataset of diverse text for language modeling." arXiv preprint arXiv:2101.00027 (2020).


  1. ESIM sequence model - Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In ACL.


  1. John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML.


  1. Word Sense Disambiguation (WSD)
  2. Part-of-speech (POS) tagging, syntactic parse tree (Constituency parsing?), dependency grammar (parsing)
  3. Sentiment analysis
  4. Co-reference Resolution
  5. Chunking
  6. Semantic Role Labeling (SRL)
  7. Named Entity Recognition (NER)
  8. Document classification
  9. Question-Answering (QA)
  10. Text summarization
  11. Entailment
  12. Spelling correction
  13. Auto-correction
  14. Information retrieval
  15. Machine Translation
  16. Information extraction ?
  17. Anaphora resolution ?


  1. BLUE (Bilingual Evaluation Understudy) score
  2. ROUGE metric
  3. Perplexity
  4. Transformer-based ? like BERTScore
  5. Simple ones like F1 score, Accuracy, etc.
  7. Ranking-based like SQuAD
  8. Correlation-based like MCC, Pearson

Applications ?

  1. Genomic sequences (DNA-based ?)


  1. Speech and Language Processing, Dan Jurafsky and James H. Martin (https://web.stanford.edu/~jurafsky/slp3/)

Blog articles

  1. Common decoding schemes: https://medium.com/nlplanet/two-minutes-nlp-most-used-decoding-methods-for-language-models-9d44b2375612


  1. NLP coursera course
  2. UMass CS685: Advanced Natural Language Processing (Spring 2023) (https://www.youtube.com/watch?v=EJ8H3Ak_afA&list=PLWnsVgP6CzaelCF_jmn5HrpOXzRAPNjWj&index=1)
  3. Georgia Tech CS 7650: Natural Language Processing (Spring 2023); Alan Ritter (https://aritter.github.io/CS-7650-sp23/)
  4. Stanford; CS224N: Natural Language Processing with Deep Learning; Chris Manning (https://web.stanford.edu/class/cs224n/index.html#coursework)
  5. CSE 5539: Cutting-Edge Topics in Natural Language Processing; Yu Su (https://ysu1989.github.io/courses/au20/cse5539/)


  1. NLTK (https://www.nltk.org/)
  2. OpenNMT (https://opennmt.net/)
  3. Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)
  4. torchtext
  5. gensim (https://pypi.org/project/gensim/)
  6. https://github.com/OpenNLPLab


  1. STRAND by Resnik
  2. Tatoeba Project (https://tatoeba.org/en ; http://www.manythings.org/anki/)
