Indonesian NLP Resources

This is the list of tutorials, workshops, talks, books, papers, and resources on computational linguistic approaches to research in Indonesian languages. The list will be updated over time. You are welcome to send a pull request to update the list and be one of the contributors! 🚀

📌 If you are working on any work related to Indonesian or any local Indonesian languages, don't hesitate to contact me or send a pull request!

📔 Books

Jan Wira Gotama Putra (2019) Pengenalan Konsep Pembelajaran Mesin dan Deep Learning (in Indonesian). [Book]

🔉 Talks

Bedah Paper Series by INACL (in Indonesian) [Video]

📑 Research Papers

Position / Survey

Aji, et al. (2022) One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia. ACL [Paper]

Datasets and Pretrained Models

Public Benchmark

Winata, et al. (2022) NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages. Preprint [Paper] [Benchmark]
Cahyawijaya, et al. (2021) IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation. EMNLP [Paper] [Benchmark] [Huggingface Models]
Wibowo, et al. (2021) IndoCollex: A Testbed for Morphological Transformation of Indonesian Colloquial Words. ACL Findings [Paper] [Benchmark]
Koto, et al. (2020) IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. COLING [Paper] [Benchmark]
Fajri Koto, and Ikhwan Koto (2020) Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation. PACLIC [Paper] [Benchmark]
Wilie, et al. (2020) IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. AACL [Paper] [Benchmark] [Huggingface Models]

Language-Specific Model

Wongso, et al. (2022) Pre-Trained Transformer-Based Language Models for Sundanese. Journal of Big Data [Paper]

Morphology Analysis

Pimentel, et al. (2021) SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages. Workshop on Computational Research in Phonetics, Phonology, and Morphology [Paper] [Dataset]

POS Tagging

Devin Hoesen and Ayu Purwarianti (2018) Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger. International Conference on Asian Language Processing [Paper] [Benchmark]
Dinakaramani, et al. (2014) Designing an Indonesian Part of speech Tagset and Manually Tagged Indonesian Corpus. International Conference on Asian Language Processing [Paper] [Dataset]

Named Entity Recognition

Devin Hoesen and Ayu Purwarianti (2018) Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger. International Conference on Asian Language Processing [Paper] [Benchmark]
Muhammad Fachri (2014) Named Entity Recognition for Indonesian Text using Hidden Markov Model. Undergraduate Thesis [Paper] [Dataset]
Alfina, et al. (2016) DBpedia Entities Expansion in Automatically Building Dataset for Indonesian NER. International Conference on Advanced Computer Science and Information Systems [Paper] [Dataset]

Word Sense Disambiguation

Mahendra, et al. (2018) Cross-Lingual and Supervised Learning Approach for Indonesian Word Sense Disambiguation Task. Global Wordnet Conference [Paper] [Dataset]

Constituency Parsing

Arwidarasti, et al. (2019) Converting an Indonesian Constituency Treebank to the Penn Treebank Format. International Conference on Asian Language Processing [Paper] [Dataset]
Moeljadi, et al. (2018) Building Cendana: a Treebank for Informal Indonesian. Global Wordnet Conference [Paper] [Dataset]
David Moeljadi (2017) Building JATI: A Treebank for Indonesian. Global Wordnet Conference [Paper] [Dataset]

Dependency Parsing

Zeman, et al. (2018) CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. CoNLL Shared Task [Paper] [Dataset]
McDonald, et al. (2013) Universal Dependency Annotation for Multilingual Parsing. ACL [Paper] [Dataset]

Coreference Resolution

Artari, et al. (2021) A Multi-Pass Sieve Coreference Resolution for Indonesian. RANLP [Paper] [Dataset]

Chatbot

Lin, et al. (2021) XPersona: Evaluating Multilingual Personalized Chatbot. NLP4ConvAI [Paper] [Benchmark] [Dataset]

Question Answering

Clark, et al. (2020) TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. TACL [Paper] [Dataset]
Purwarianti, et al. (2007) A Machine Learning Approach for Indonesian Question Answering System. RANLP [Paper] [Benchmark]

Summarization

Kemal Kurniawan and Samuel Louvan (2018) A New Benchmark Dataset for Indonesian Text Summarization. International Conference on Asian Language Processing [Paper] [Benchmark] [Dataset]
Koto, et al. (2020) A Large-scale Indonesian Dataset for Text Summarization. AACL [Paper] [Benchmark] [Dataset]

Keyphrase Extraction

Mahfuzh, et al. (2019) Improving Joint Layer RNN based Keyphrase Extraction by Using Syntactical Features. International Conference of Advanced Informatics: Concepts, Theory and Applications [Paper] [Benchmark]

Natural Language Inference

Mahendra, et al. (2021) IndoNLI: A Natural Language Inference Dataset for Indonesian. EMNLP [Paper] [Dataset]
Ken Nabila Setya and Rahmad Mahendra (2018) Semi-supervised Textual Entailment on Indonesian Wikipedia Data. International Conference on Computational Linguistics and Intelligent Text Processing [Paper] [Benchmark]

Sentiment Analysis

Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti (2019) Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector. International Conference of Advanced Informatics: Concepts, Theory and Applications [Paper] [IndoNLU Benchmark] [NusaX Benchmark]
Azhar, et al. (2019) Multi-label Aspect Categorization with Convolutional Neural Networks and Extreme Gradient Boosting. International Conference on Electrical Engineering and Informatics [Paper] [Benchmark]
Ilmania, et al. (2018) Aspect Detection and Sentiment Classification Using Deep Neural Network for Indonesian Aspect-Based Sentiment Analysis. International Conference on Asian Language Processing [Paper] [Benchmark]

Emotion Classification

Saputri, et al. (2018) Emotion Classification on Indonesian Twitter Dataset. International Conference on Asian Language Processing [Paper] [Dataset]

Stance Detection

Jannati, et al. (2018) Stance Classification Towards Political Figures on Blog Writing. International Conference on Asian Language Processing [Paper] [Dataset]

Hate Speech Detection

Alfina, et al. (2017) Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study. International Conference on Advanced Computer Science and Information Systems [Paper] [Dataset]
Muhammad Okky Ibrohim and Indra Budi (2018) A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media. International Conference on Computer Science and Computational Intelligence [Paper] [Dataset]
Muhammad Okky Ibrohim and Indra Budi (2019) Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter. Workshop on Abusive Language Online [Paper] [Dataset]

Clickbait Detection

Andika William and Yunita Sari (2020) CLICK-ID: A Novel Dataset for Indonesian Clickbait Headlines. Data in Brief [Paper] [Dataset]

Style Transfer

Wibowo, et al. (2020) Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation. International Conference on Asian Language Processing [Paper] [Dataset]

🧪 Collaborative Project

IndoNLP is going to start collecting new datasets at https://github.com/orgs/IndoNLP. They will open the submission starting mid June 2022. Stay tuned!

gentaiscool/indonesian-nlp