
data resource untuk NLP bahasa indonesia

MIT LicenseMIT


Data NLP for bahasa indonesia (last update 20 sep 2020)

Sentences Dataset

  1. leipzig indonesian sentence collectoin news articles, web articles, wikipedia data from 2008-2016
  2. wn-msa.sourceforge.net Wordnet Bahasa
  3. Quran indonesian quran translation (id.muntakhab, id.jalalayn, id.indonesian)
  4. Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
  5. Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
  6. corpus-frog-storytelling spoken text story telling
  7. TED-Multilingual-Parallel-Corpus Monolingual_data/Indonesian
  8. Opus Opus NLPL
  9. Sealang Sealang dataset

Word reference (kemdikbud) link

  1. Entri Dasar : 50.668 (45,02 %)
  2. Kata Turunan : 26.835 (23,85 %)
  3. Gabungan Kata : 31.492 (27,98 %)
  4. Peribahasa : 2.054 (1,83 %)
  5. Kiasan : 269 (0,24 %)
  6. Ungkapan : 1.131 (1,00 %)
  7. Varian : 89 (0,08 %)
  8. Entri Total : 112.538 (100,00 %)
  9. Makna Total : 131.533
  10. Contoh Total : 30.010
  11. Kategori Total : 234
  12. Makna Per Entri : 1,169
  13. Contoh Per Makna : 0,228

Words dataset (PUEBI word type )

  1. word class => word noun(18647), word verb(39070) = 57717 words
  2. word type => rootword(41409), derivative word(24913), compound words, Figure of speech, proverb, expression = 66322 words
  3. Word root => source#1.1 : sastrawi 29932 words ; source#1.2 : sastrawi 30342 words ; source#2 : SentiStrengthID 27979 words ; source#3 : serangkai 30342 words
  4. Word spaCy : id
  5. word : serangkai
  6. Word name : random-name
  7. Word Indo name : genderprediction
  8. Word Wiktionary : word id
  9. word compound =>
  10. Word Acronims =>
  11. Word Negative =>
  12. Word Positive =>
  13. Word Slang =>
  14. Stopwords =>
  15. Emoticon =>
  16. Name Entity =>
    • source#1 : [Place] country
    • source#1 : [Place] Wilayah-Administratif-Indonesia (provinces, villages, districts, regencies)
    • source#2 : [Place] Indonesia-Postal-Code (provinces, cities, subdistricts, urbans)
    • source#3 : [Place] indonesian-region
    • source#3 : [Person] gender prediction
    • source#4 : [Person] random name
    • source#5 : [Person] title of name
    • source#6 : [Person] degree
    • source#7 : [Org] institution

Tagged dataset

  1. NER =>
    • source#1 : yohanesgultom/nlp-experiments 1700 sentences
    • source#2 : yusufsyaifudin/indonesia-ner 1835 sentences
  2. POS-TAG
    • POS-TAG : famrashel/idn-tagged-corpus
    • POS-TAG : pebbie/pebahasa ~600 sentence
    • POS-TAG Parser : UniversalDependencies/UD_Indonesian-GSD ~4477 sentence
  3. Sentimen =>
    • source#1 : 1506 sentences ;
    • source#2 : Sentiment word with strenght agusmakmun/SentiStrengthID 1573 (range : -5 until 5 ) ;
    • source#3 : Sentiment with weight fajri91/InSet -> separate word list with weight of the strength (range : -5 until 5 ). 6610 negative words and 3619 positive words
  4. panl10n Pan Localization
  5. Acronyms : ramaprakoso/analisis-sentimen 4085 words

Parallel corpus Eng-Ind

  1. parallel-corpora-en-id
  2. Indonesian-English-Bilingual-Corpus
  3. TALPCo
  4. opus
  5. Multi-Wiki

Sentence Analyzer

  1. MALINDO_Morph
  2. morphind
  3. INDRA
  4. pujangga : An interface for InaNLP and Deeplearning4j's Word2Vec for Indonesian (Bahasa Indonesia) in the form of REST API.
  5. id-multi-label-hate-speech-and-abusive-language-detection : Here we provide our dataset for multi-label hate speech and abusive language detection in the Indonesian Twitter.
  6. kawat : A Word Analogy Task Dataset for Indonesian

Crawler Data

  1. Crawler Indonesian news portal