/nlp-resources

Natural language processing resources for multiple languages, with an eye towards use for digital humanities.

GNU General Public License v3.0GPL-3.0

Multilingual NLP

This list of free and open-source NLP resources, and pointers to language-specific directories of resources, was originally created for a presentation at UCLA on teaching multilingual digital humanities, on May 15, 2019.

This is not a directory but a moderately-opinionated, potentially one-time list of resources that might be of use to digital humanities folks working with languages other than English. That said, if you have suggestions, you can make a pull request.

* indicates resources I've tried out, ^ indicates resources I've created.

Language-agnostic tools & methods

These tools and methods are not tied to any particular language. The caveat is that words have to be separated by a space (and what a "word" is may vary language-to-language, and not all languages put spaces between languages). A further caveat is that highly-inflected languages (e.g. languages with a lot of grammatical cases, like Latin, Russian, or Finnish) may perform poorly without lemmatization (using the "dictionary form" of words, versus whatever inflected form is actually present in the text), especially for smaller text corpora.

Modern languages

If you're comfortable working with Python, the Polyglot library provides language detection for 196 languages, tokenization in 165 languages, named entity recognition in 40 languages, part-of-speech tagging in 16 languages, sentiment analysis in 136 languages, and morphological analysis in 135 languages. It can also manage text in multiple languages at once. If you're working a lot with one particular language, it's probably best to find more language-specific tools, but as a better-than-nothing option for highly underresourced languages, it's an option.

A few other general thoughts & notes:

  • Be very wary of stopword lists. Make sure you have someone who can read the language review it before you pick it up and use it, or worst case, start running it through Google Translate. Stopword lists often include all sorts of words that only count as "stopwords" in the domain they're being used for, and you might inadvertently be exclusing, for instance, all words about computers. The longer the stopword list, the more suspicious you should be.
  • For very underresourced languages (endangered languages, languages with very small speaker groups, especially languages with unique writing systems) you may find scholarly articles about NLP, but in most cases, whatever proof-of-concept is presented in the paper is a long way from being usable, and odds aren't great that it will get there.

Arabic

Arabic has to be segmented (clitic segmentation) before it can be used well with language-agnostic tools. The Stanford Word Segmenter supports Arabic; usage should be similar to the Chinese segmenter tutorials.

Armenian

I stumbled onto Armenian recently while looking at full-text PDFs in HathiTrust. The OCR for all the Armenian books I came across was Latin or Greek jibberish, though I was able to get (what looked to me, playing match-the-squiggles) reasonable OCR out of Tesseract. I had a nice exchange with HathiTrust about it, suggesting that I report the errors I came across. In the meantime, though, plan to re-OCR the text if you're getting Armenian from HathiTrust.

  • Named-entity recognition: training data for Armenian NER using Wikipedia
  • Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Armenian

Chinese

Chinese needs to be segmented (spaces artificially inserted between words) before it can be used with language-agnostic tools. Stanford NLP Group has a Chinese segmenter. Michelle Fullwood has written a tutorial on using the segmenter.

Dutch

French

French is partly supported by Stanford Core NLP, so the instructions for doing part-of-speech tagging should be almost identical to other languages that can use that software. Stanford Core NLP doesn't support French named-entity recognition, but there are other tools you can use like OpenNER.

  • Tutorial (with modifications): ^Part-of-speech tagging with Stanford NLP: this is the German tutorial, but in step 3, replace german-hgc.tagger with french.tagger in the code that you run. You can also use a Universal Dependencies-based tagger (also described in the German tutorial) by replacing german-hgc.tagger with french-ud.tagger. The standard French tagger uses tags from the French treebank.
  • Named-entity recognition: OpenNER supports French
  • Python: SpaCy offers POS tags, dependency parse and named entities for French based on a news corpus
  • CamemBERT language model: for part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).
  • Flaubert: word embeddings compatible with Hugging Face's Transformers library
  • Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for French

German

There is a large community of DH folks doing text analysis on German under the "Digital Humanities im deutschsprachigen Raum" organization. Projects include QuaDramA – Quantitative Drama Analytics and Rhythmicalizer. A digital tool to identify free verse prosody.

Hebrew

I've recently been working on a Hebrew NLP project, and should have more experience with these tools soon. Because Hebrew is a right-to-left language, I've noticed a few challenges, including file-renaming when the file names include both Hebrew and Latin characters. You may also have to navigate the right-to-left mark Unicode character when processing the text.

  • Directory: Hebrew NLP resources
  • Topic modeling: LemLDA: an LDA Package for Hebrew - you'll probably need to run the rule-based Hebrew tokenizer (below) on your text before trying it with this tool-- punctuation like parentheses breaks it.
  • Python: *rule-based Hebrew tokenizer - I've had some problems with this (Mac, Python 3.7) with regard to successfully saving the output file, but I've stuck the core functions in a Jupyter notebook and added my own input/output code, and it's worked well.
  • Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Hebrew

Hindi

Indonesian

Italian

The major tool available for Italian is *Tint, which is based on (and depends on) Stanford NLP, but not all of the features work well. If you try one output format and it doesn't work, try another. (I can vouch for the .conll format.)

Japanese

Japanese has to be segmented before it can be used with language-agnostic tools, though Japanese segmentation is built into Voyant in theory (your mileage may vary, it just crashed for me when I tried it with a small corpus).

The most commonly used tool for Japanese text processing is MeCab, which provides segmentation and part-of-speech tagging. There are options for using it with Python, with Python on Mac and with R, but it depends on a library in C++ that may be a problem to get running. (I failed to get any version of MeCab working on a Mac, but I've seen others using it successfully on Windows.) A number of the people I've worked with haven't been very happy with the quality of its segmentation, and have preferred RakutenMA, which is what I've used.

Korean

  • Python: KoNLPy: Korean NLP in Python, includes part-of-speech tagging, corpora, dictionaries
  • R: KoNLP, part-of-speech tagging
  • Directory: Awesome-Korean-NLP, a curated directory of resources, hasn't been updated in about two years
  • Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Korean

Mongolian

  • Directory: Mongolian NLP - includes named-entity recognition, data sets (e.g. with personal and clan names)
  • Python: the Polyglot library supports language detection, morphological analysis, and sentiment analysis for Mongolian

Portuguese

Portuguese is comapratively underresourced for text analysis relative to other colonial languages. While there's materials for training named-entity recognition for Portuguese, you need larger-than-laptop compute to train it. I mean to get back to it as an excuse to learn how to use our local high-performance computing cluster.

Russian

*MyStem from Yandex (Russia's equivalent to Google) is the best NLP toolkit for Russian, and can be downloaded as standalone code. There's a wrapper for Python with PyMyStem3.

Because Russian is highly inflected (i.e. a word can appear in many forms depending on how it's used in a sentence), and each word form is treated as a separate "word" for language-agnostic tools and methods, you may get better results by lemmatizing Russian text before using it with these tools. MyStem can do this, and Python code for doing it is included in the Russian text cleaning & word vectors Jupyter notebook.

Spanish

Tagalog

Thai

Tibetan

Vitenamese

Welsh

  • Python: CyTag - text segmenter, sentence splitter, tokeniser, part-of-speech tagger
  • There's a few papers (e.g. Towards a Welsh Semantic Annotation System) talking about work on CySemTagger (a Welsh semantic annotation tool), but there doesn't seem to be a usable version yet

Yiddish

The Yiddish Book Center has thousands of scanned PDFs of books in Yiddish, but without OCR. To get plain text versions of the books (OCR'd using Jochre), you can create an account here.

  • Python: the Polyglot library supports language detection, morphological analysis, transliteration, and sentiment analysis for Yiddish

Historical languages

NLP tools for historical languages make the most sense when the language is attested in many (thousands+) documents, and the documents haven't already received a lot of scholarly attention for manual markup and analysis. Akkadian cuneiform tablets continue to be unearthed, and many of those found in archaeological digs over the last century have not yet been published in any usable form. In contrast, there have been far fewer new discoveries of texts in Old Church Slavonic, and the known manuscripts have already been thoroughly marked up by experts. As such, Akkadian is a better target for developing NLP than Old Church Slavonic.

Classical Languages Toolkit (multilingual)

CLTK provides NLP for the languages of Ancient, Classical, and Medieval Eurasia. While Greek, Latin, Akkadian, and the Germanic languages are the most complete, there is also some support for Arabic, Chinese, Ancient Egyptian, Ottoman Turkish, and various classical languages of India. Read the documentation for more information about the extent and nature of the library's coverage.

Latin

  • Linked data: LiLa - Linking Latin is developing a linked data knowledge base for Latin
  • Texts: Digital Latin Library publishes critical editions of Latin texts, and facilitates finding texts online that are written in Latin
  • Texts: Perseus Digital Library, a longstanding digital humanities project, with texts in Greek, Arabic, and English along with Latin

Coptic

Other language families & groups

Languages of Africa

Kathleen Siminyu has been working on developing NLP resources for languages of Africa, and posting update on LinkedIn. A February 2019 post describes work on a Luganda-Kinyarwanda translation model based on word vector embeddings. There's also a collaborative project underway to develop translation models for African languages.

Indigenous languages of the Americas

"Challenges of language technologies for the indigenous languages of the Americas" (Manuel Mager, Ximena Gutierrez-Vasques, Gerardo Sierra, & Ivan Meza) in Proceedings of the 27th International Conference on Computational Linguistics, 2018) has an excellent overview of the current state of NLP for a variety of indigenous languages of the Americas.

The authors also maintain an updated directory of NLP resources for indigenous languages of the Americas.