This list of free and open-source NLP resources, and pointers to language-specific directories of resources, was originally created for a presentation at UCLA on teaching multilingual digital humanities, on May 15, 2019.
This is not a directory but a moderately-opinionated, potentially one-time list of resources that might be of use to digital humanities folks working with languages other than English. That said, if you have suggestions, you can make a pull request.
* indicates resources I've tried out, ^ indicates resources I've created.
These tools and methods are not tied to any particular language. The caveat is that words have to be separated by a space (and what a "word" is may vary language-to-language, and not all languages put spaces between languages). A further caveat is that highly-inflected languages (e.g. languages with a lot of grammatical cases, like Latin, Russian, or Finnish) may perform poorly without lemmatization (using the "dictionary form" of words, versus whatever inflected form is actually present in the text), especially for smaller text corpora.
- Voyant
- Lexos
- Topic modeling - like Mallet; if you use the topic modeling tool for a GUI-based interface, be sure to go into the "optional settings" and remove the text in the "Tokenize with regular expression" field. Also, your text files must be saved as UTF-8 otherwise it won't work.
- Word vectors - Ryan Heuser has a nice set of blog posts introducing word vectors for literary analysis, and you can adapt this Jupyter notebook for Russian text cleaning & word vectors to most languages (more generalized word vector notebook / tutorial coming this summer!)
- Word counting, keyword-in-context, and similar approaches (many of which are included in Voyant and Lexos, but you could also write Python or R code)
If you're comfortable working with Python, the Polyglot library provides language detection for 196 languages, tokenization in 165 languages, named entity recognition in 40 languages, part-of-speech tagging in 16 languages, sentiment analysis in 136 languages, and morphological analysis in 135 languages. It can also manage text in multiple languages at once. If you're working a lot with one particualr language, it's probably best to find more language-specific tools, but as a better-than-nothing option for highly underresourced languages, it's an option.
A few other general thoughts & notes:
- Be very wary of stopword lists. Make sure you have someone who can read the language review it before you pick it up and use it, or worst case, start running it through Google Translate. Stopword lists often include all sorts of words that only count as "stopwords" in the domain they're being used for, and you might inadvertently be exclusing, for instance, all words about computers. The longer the stopword list, the more suspicious you should be.
- For very underresourced languages (endangered languages, languages with very small speaker groups, especially languages with unique writing systems) you may find scholarly articles about NLP, but in most cases, whatever proof-of-concept is presented in the paper is a long way from being usable, and odds aren't great that it will get there.
Arabic has to be segmented (clitic segmentation) before it can be used well with language-agnostic tools. The Stanford Word Segmenter supports Arabic; usage should be similar to the Chinese segmenter tutorials.
- NLP resource directory: Github has a tag for Arabic NLP including pointers to repos for sentiment analysis for tweets, reviews, and standard Arabic, named entity recognition, etc.
- Part-of-speech tagger: Stanford NLP has a part-of-speech tagger for Arabic
- OCR: Tesseract 4.0 has training data for Arabic
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis (standard and Egyptian Arabic), transliteration, and sentiment analysis (standard and Egyptian Arabic)
I stumbled onto Armenian recently while looking at full-text PDFs in HathiTrust. The OCR for all the Armenian books I came across was Latin or Greek jibberish, though I was able to get (what looked to me, playing match-the-squiggles) reasonable OCR out of Tesseract. I had a nice exchange with HathiTrust about it, suggesting that I report the errors I came across. In the meantime, though, plan to re-OCR the text if you're getting Armenian from HathiTrust.
- OCR: Tesseract 4.0 has *training data for Armenian
- Named-entity recognition: training data for Armenian NER using Wikipedia
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Armenian
Chinese needs to be segmented (spaces artificially inserted between words) before it can be used with language-agnostic tools. Stanford NLP Group has a Chinese segmenter. Michelle Fullwood has written a tutorial on using the segmenter.
- Tutorial: ^Chinese part-of-speech tagging with Stanford NLP using Stanford NLP tools
- Tutorial: ^Chinese named-entity recognition with Stanford NLP using Stanford NLP tools
- Python: *xpinyin (for Mandarin) and python-jyutping (for Cantonese), for transliterating Chinese into a phonetic representation, with or without tones. (Example: ^Taiwanese rap analyzer Jupyter notebook for identifying lines of Taiwanese rap lyrics that include repeated tones.)
- Python: *PyCantonese: includes jyutping converter/search, stopwords, and part-of-speech tagging for Cantonese.
- OCR: Tesseract 4.0 has training data for simplified Chinese Characters, vertical simplified Chinese characters, traditional Chinese characters, and vertical traditional Chinese characters
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis (Chinese and Gan Chinese), transliteration, and sentiment analysis (Chinese and Gan Chinese)
French is partly supported by Stanford Core NLP, so the instructions for doing part-of-speech tagging should be almost identical to other languages that can use that software. Stanford Core NLP doesn't support French named-entity recognition, but there are other tools you can use like OpenNER.
- Tutorial (with modifications): ^Part-of-speech tagging with Stanford NLP: this is the German tutorial, but in step 3, replace german-hgc.tagger with french.tagger in the code that you run. You can also use a Universal Dependencies-based tagger (also described in the German tutorial) by replacing german-hgc.tagger with french-ud.tagger. The standard French tagger uses tags from the French treebank.
- Named-entity recognition: OpenNER supports French
- Python: SpaCy offers POS tags, dependency parse and named entities for French based on a news corpus
- OCR: Tesseract 4.0 has training data for French
- Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for French
There is a large community of DH folks doing text analysis on German under the "Digital Humanities im deutschsprachigen Raum" organization. Projects include QuaDramA – Quantitative Drama Analytics and Rhythmicalizer. A digital tool to identify free verse prosody.
- Tutorial: ^German part-of-speech tagging with Stanford NLP
- Tutorial: ^German named-entity recognition with Stanford NLP
- Book & code: Andrew Piper's Enumerations: Data and Literary Study (Chicago 2018) includes numerous German examples. The data and code from the book are available on Github
- Jupyter notebooks: Example code for doing German NLP with different packages
- Directory: NLP resources and tools for German
- Python: SpaCy offers POS tags, dependency parse and named entities for German based on a news corpus
- OCR: Tesseract 4.0 has training data for German as well as Fraktur
- Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for German
- Blog post: Common pitfalls with the pre-processing of German text for NLP - geared towards commercial applications, but provides a useful overview and comparison of different part-of-speech taggers, stopword lists, compound splitting, etc.
I've recently been working on a Hebrew NLP project, and should have more experience with these tools soon. Because Hebrew is a right-to-left language, I've noticed a few challenges, including file-renaming when the file names include both Hebrew and Latin characters. You may also have to navigate the right-to-left mark Unicode character when processing the text.
- Directory: Hebrew NLP resources
- Topic modeling: LemLDA: an LDA Package for Hebrew - you'll probably need to run the rule-based Hebrew tokenizer (below) on your text before trying it with this tool-- punctuation like parentheses breaks it.
- Python: *rule-based Hebrew tokenizer - I've had some problems with this (Mac, Python 3.7) with regard to successfully saving the output file, but I've stuck the core functions in a Jupyter notebook and added my own input/output code, and it's worked well.
- OCR: Tesseract 4.0 has training data for Hebrew
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Hebrew
- Jupyter notebooks: Tokenizer and classifier (positive/neutral/negative) for Hindi based on Wikipedia, movie reviews, BBC news
- Jupyter notebook: Training Hindi word embeddings using Wikipedia data
- Python: Tokenizer and stemmer for Hindi
- Python: Hindi dependency parser
- Tutorial/Python: Hindi part-of-speech tagging using NLTK
- OCR: Tesseract 4.0 has training data for Hindi
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis (Hindi and Fiji Hindi), transliteration, and sentiment analysis for Hindi
- Directory: Indonesian NLP resources
- Directory: Bahasa Indonesia Natural Language Processing
- OCR: Tesseract 4.0 has training data for Indonesian
- Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Indonesian
The major tool available for Italian is *Tint, which is based on (and depends on) Stanford NLP, but not all of the features work well. If you try one output format and it doesn't work, try another. (I can vouch for the .conll format.)
- Tutorial: ^Italian part-of-speech tagging with Tint
- Tutorial: ^Italian named-entity recognition with Tint
- Python: SpaCy offers POS tags, dependency parse and named entities for Italian based on a news corpus
- Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Italian
Japanese has to be segmented before it can be used with language-agnostic tools, though Japanese segmentation is built into Voyant in theory (your mileage may vary, it just crashed for me when I tried it with a small corpus).
The most commonly used tool for Japanese text processing is MeCab, which provides segmentation and part-of-speech tagging. There are options for using it with Python, with Python on Mac and with R, but it depends on a library in C++ that may be a problem to get running. (I failed to get any version of MeCab working on a Mac, but I've seen others using it successfully on Windows.) A number of the people I've worked with haven't been very happy with the quality of its segmentation, and have preferred RakutenMA, which is what I've used.
- Directory: Japanese text analysis by Molly Des Jardin
- Jupyter notebook: Japanese segmentation with RakutenMA - doesn't work yet on Windows due to Unicode issues
- Tutorial: Japanese part-of-speech tagging with RakutenMA - uses the demo web interface for RakutenMA
- Tutorial: Japanese named-entity recognition with Apache OpenNLP
- OCR: Tesseract 4.0 has training data for Japanese and vertical Japanese but Japanese OCR isn't great. For her PDFs, Adobe Acrobat Professional performed best, but all the tools had problems especially with half-width characters and furigana.
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Indonesian
- Python: KoNLPy: Korean NLP in Python, includes part-of-speech tagging, corpora, dictionaries
- R: KoNLP, part-of-speech tagging
- Directory: Awesome-Korean-NLP, a curated directory of resources, hasn't been updated in about two years
- OCR: Tesseract 4.0 has training data for Korean and vertical Korean
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Korean
- Directory: Mongolian NLP - includes named-entity recognition, data sets (e.g. with personal and clan names)
- OCR: Tesseract 4.0 has training data for Mongolian
- Python: the Polyglot library supports language detection, morphological analysis, and sentiment analysis for Mongolian
Portuguese is comapratively underresourced for text analysis relative to other colonial languages. While there's materials for training named-entity recognition for Portuguese, you need larger-than-laptop compute to train it. I mean to get back to it as an excuse to learn how to use our local high-performance computing cluster.
- Tutorial: ^Brazilian Portuguese part-of-speech tagger
- Incomplete tutorial: ^Portuguese named-entity recognition, based on this tutorial by André Pires and using materials from his master's thesis
- Python: SpaCy offers POS tags, dependency parse and named entities for Portuguese based on a news corpus
- Tutorial: Portuguese examples for Natural Language Processing with Python (NLTK)
- OCR: Tesseract 4.0 has training data for Portuguese
- Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Portuguese
*MyStem from Yandex (Russia's equivalent to Google) is the best NLP toolkit for Russian, and can be downloaded as standalone code. There's a wrapper for Python with PyMyStem3.
Because Russian is highly inflected (i.e. a word can appear in many forms depending on how it's used in a sentence), and each word form is treated as a separate "word" for language-agnostic tools and methods, you may get better results by lemmatizing Russian text before using it with these tools. MyStem can do this, and Python code for doing it is included in the Russian text cleaning & word vectors Jupyter notebook.
- Tutorial: ^Russian part-of-speech tagger
- Tutorial: ^Russian named-entity recognition, uses Natasha Python module
- Jupyter notebook: ^Russian text cleaning & word vectors
- Python: *Natasha module for named-entity recognition
- OCR: Tesseract 4.0 has training data for Russian
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Russian
- Tutorial: ^Spanish part-of-speech tagging with Stanford NLP
- Tutorial: ^Spanish named-entity recognition with Stanford NLP
- Python: SpaCy offers POS tags, dependency parse and named entities for Spanish based on a news corpus
- Directory: Spanish NLP
- OCR: Tesseract 4.0 has training data for Spanish and Old Spanish
- Python: the Polyglot library supports language detection, part-of-speech tagging, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Spanish
- Part-of-speech tagger: Filipino tagger for use with Stanford NLP tagger
- Sentiment analysis: Sentiment analysis for Filipino tweets
- OCR: Tesseract 4.0 has training data for Tagalog
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Tagalog
- Directory: Thai NLP
- Directory: Github has a Thai NLP tag
- Python: PyThaiNLP - transliteration, tokenizing, part-of-speech tagger, collation, and other features.
- R: thainltk: Thai National Language Toolkit
- OCR: Tesseract 4.0 has training data for Thai
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Thai
- Python: Pybo Tibetan tokenizer
- Transliteration: Wylie converter, a wrapper for an existing Perl tool
- Transliteration: Tibetan Phonetics Engine, transliteration based on different schemes / dialects
- Part-of-speech tagger: Universal Dependencies Part-of-Speech Tagger for Tibetan
- Python: the Polyglot library supports language detection, morphological analysis, and sentiment analysis for Tibetan
- VnCoreNLP: A Vietnamese natural language processing toolkit (Java) - provides word segmentation, POS tagging, named entity recognition (NER) and dependency parsing
- Directory: Github has a Vietnamese NLP tag
- Sentiment analysis: Vietnamese sentiment analysis for tweets
- OCR: Tesseract 4.0 has training data for Vietnamese
- Python: the Polyglot library supports language detection, named entity extraction (using Wikipedia data), morphological analysis, transliteration, and sentiment analysis for Vietnamese
The Yiddish Book Center has thousands of scanned PDFs of books in Yiddish, but without OCR. To get plain text versions of the books (OCR'd using Jochre), you can create an account here.
- OCR: Jochre does Yiddish OCR using supervised machine learning techniques
- Python: the Polyglot library supports language detection, morphological analysis, transliteration, and sentiment analysis for Yiddish
NLP tools for historical languages make the most sense when the language is attested in many (thousands+) documents, and the documents haven't already received a lot of scholarly attention for manual markup and analysis. Akkadian cuneiform tablets continue to be unearthed, and many of those found in archaeological digs over the last century have not yet been published in any usable form. In contrast, there have been far fewer new discoveries of texts in Old Church Slavonic, and the known manuscripts have already been thoroughly marked up by experts. As such, Akkadian is a better target for developing NLP than Old Church Slavonic.
CLTK provides NLP for the languages of Ancient, Classical, and Medieval Eurasia. While Greek, Latin, Akkadian, and the Germanic languages are the most complete, there is also some support for Arabic, Chinese, Ancient Egyptian, Ottoman Turkish, and various classical languages of India. Read the documentation for more information about the extent and nature of the library's coverage.
- Jupyter notebooks: CLTK offers Jupyter notebook tutorials for how to use its functionality
- Linked data: LiLa - Linking Latin is developing a linked data knowledge base for Latin
- Texts: Digital Latin Library publishes critical editions of Latin texts, and facilitates finding texts online that are written in Latin
- Texts: Perseus Digital Library, a longstanding digital humanities project, with texts in Greek, Arabic, and English along with Latin
- OCR: Tesseract 4.0 has training data for Latin
Kathleen Siminyu has been working on developing NLP resources for languages of Africa, and posting update on LinkedIn. A February 2019 post describes work on a Luganda-Kinyarwanda translation model based on word vector embeddings.
"Challenges of language technologies for the indigenous languages of the Americas" (Manuel Mager, Ximena Gutierrez-Vasques, Gerardo Sierra, & Ivan Meza) in Proceedings of the 27th International Conference on Computational Linguistics, 2018) has an excellent overview of the current state of NLP for a variety of indigenous languages of the Americas.
The authors also maintain an updated directory of NLP resources for indigenous languages of the Americas.
- Cherokee OCR: Tesseract 4.0 has trained data for Cherokee
- Inuktitut OCR: Tesseract 4.0 has trained data for Inuktitut