indonesian-NLP-resources

Data NLP for bahasa indonesia (last update 20 sep 2020)

Sentences Dataset

leipzig indonesian sentence collectoin news articles, web articles, wikipedia data from 2008-2016
wn-msa.sourceforge.net Wordnet Bahasa
Quran indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
corpus-frog-storytelling spoken text story telling
TED-Multilingual-Parallel-Corpus Monolingual_data/Indonesian
Opus Opus NLPL
Sealang Sealang dataset

Words dataset (PUEBI word type )

word class => word noun(18647), word verb(39070) = 57717 words
word type => rootword(41409), derivative word(24913), compound words, Figure of speech, proverb, expression = 66322 words
Word root => source#1.1 : sastrawi 29932 words ; source#1.2 : sastrawi 30342 words ; source#2 : SentiStrengthID 27979 words ; source#3 : serangkai 30342 words
Word spaCy : id
word : serangkai
Word name : random-name
Word Indo name : genderprediction
Word Wiktionary : word id
word compound =>
- source#1 : 71 words
- source#2 : puebi
Word Acronims =>
- source#1 : 4085 words ;
- source#2 : 70 words
Word Negative =>
- source#1.1 : 3829 words ; source#1.2 : 3523 words ; source#1.3 : 154 words ;
- source#2 : ID-OpinionWords 2402 words
- source#3 : 3523 words
- source#4 : 126 words
Word Positive =>
- source#1.1 : 1678 words ; source#1.2 : 40 words ; source#1.3 : 1293 words ;
- source#2 : 1182 words
- source#3 : 1293 words
Word Slang =>
- source#1 : 1319 words ;
- source#2 : 286 words ;
- source#3 : 1147 words
- source#4 : 62 words
- source#4 : 15167 words
Stopwords =>
- source#1 : spacy data ;
- source#2 : 759 words ;
- source#3 : 399 words ;
- source#4 : 759+329+124+126 words
Emoticon =>
- source#1 : 252 ;
- source#2 : 3018 ;
- source#3 : 123
Name Entity =>
- source#1 : [Place] country
- source#1 : [Place] Wilayah-Administratif-Indonesia (provinces, villages, districts, regencies)
- source#2 : [Place] Indonesia-Postal-Code (provinces, cities, subdistricts, urbans)
- source#3 : [Place] indonesian-region
- source#3 : [Person] gender prediction
- source#4 : [Person] random name
- source#5 : [Person] title of name
- source#6 : [Person] degree
- source#7 : [Org] institution

Tagged dataset

NER =>
- source#1 : yohanesgultom/nlp-experiments 1700 sentences
- source#2 : yusufsyaifudin/indonesia-ner 1835 sentences
POS-TAG
- POS-TAG : famrashel/idn-tagged-corpus
- POS-TAG : pebbie/pebahasa ~600 sentence
- POS-TAG Parser : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
Sentimen =>
- source#1 : 1506 sentences ;
- source#2 : Sentiment word with strenght agusmakmun/SentiStrengthID 1573 (range : -5 until 5 ) ;
- source#3 : Sentiment with weight fajri91/InSet -> separate word list with weight of the strength (range : -5 until 5 ). 6610 negative words and 3619 positive words
panl10n Pan Localization
Acronyms : ramaprakoso/analisis-sentimen 4085 words

Parallel corpus Eng-Ind

Sentence Analyzer

MALINDO_Morph
morphind
INDRA
pujangga : An interface for InaNLP and Deeplearning4j's Word2Vec for Indonesian (Bahasa Indonesia) in the form of REST API.
id-multi-label-hate-speech-and-abusive-language-detection : Here we provide our dataset for multi-label hate speech and abusive language detection in the Indonesian Twitter.
kawat : A Word Analogy Task Dataset for Indonesian

Crawler Data

Crawler Indonesian news portal

meizee/indonesian-NLP-resources

indonesian-NLP-resources

Sentences Dataset

Word reference (kemdikbud) link

Words dataset (PUEBI word type )

Tagged dataset

Parallel corpus Eng-Ind

Sentence Analyzer

Crawler Data