Curated list of open-access/open-source/off-the-shelf resources and tools developed with a particular focus on German
Resources and tools which can be used either off-the-shelf or with minor adjustments and which are currently maintained are primarily chosen for this list. It is deliberately biased in terms of usability and user-friendliness.
Pull requests and suggestions are welcome! See contributing guidelines.
- Corpora
- Generic resources
- Linguistic processing
- Semantic analysis
- Speech NLP
- Machine Translation
- Teaching resources and tutorials
- More lists
- Deutsches Textarchiv
- Elektronische Texte (Thomas Gloning)
- German Drama Corpus
- Referenzkorpus Mittelhochdeutsch (1050-1350)
- Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200-1650)
- Referenzkorpus Frühneuhochdeutsch (1350-1650)
- arg-microtexts
- auto-hMDS (multi-document summarization)
- Dortmunder Chat Korpus
- German Political Speeches Corpus
- GermaParl (Bundestag)
- One Million Posts Corpus
- Open Speech Data Corpus
- Potsdam Commentary Corpus (PCC)
- TTLab StadtWiki Corpus
- ArchiMob Corpus
- NOAH's Corpus of Swiss German Dialects
- SpinningBytes Swiss German Sentiment Corpus
- Swiss SMS Corpus
- DeReWo
- DiMLex (lexicon of German discourse markers)
- German Compound Database
- German nouns from Wiktionary
- German Wiktionary Lexicon Graph
- German word list for GNU Aspell
- OpenThesaurus
- CLARIN-D list
- Corpora at the IMS
- CorpusExplorer's list of corpora
- Korpora am IMS
- Parallel corpora (see below)
- Treebanks (see below)
- ZAS list
- AmbiverseNLU
- CLARIN-D web tools
- CorpusExplorer
- DKPro
- flair
- Mate Tools
- spaCy
- Stanford CoreNLP
- textblob-de
- TextImager
- German Universal Dependency Treebank
- Hamburg Dependency Treebank
- NEGRA
- TIGER Corpus
- TGermaCorp (literary texts)
- TüBa-D/Z
- CharSplit
- DEMorphy
- Durm Lemmatizer
- HypheNN-de
- jwordsplitter
- MarMoT
- Morphy
- morphisto
- SECOS (unsupervised compound splitter)
- SMOR
- AmbiverseNLU KnowNER
- flair
- GermaNER
- LSTM+CRF+FastText with models for (historic) German
- microNER
- ner-corpora
- (Faruqui & Pado 2010) Components and evaluation data
- Complex Word Identification (DE, EN, ES, FR)
- Distributional thesauri (includes German)
- Lexical Chains
- schulteimwalde.de/resources.html
- Semantic Relations in Context
- UKP Darmstadt data list
- disco (semantic similarity)
- GermaNet
- german2vec
- GermanWordEmbeddings
- Open German WordNet
- sensegram
- SpinningBytes word embeddings (tweets)
- UBY Linked Lexical Resource
- GermanPolarityClues
- HeiST – Heidelberg Sentiment Treebank
- Potsdam Twitter Sentiment Corpus (PotTS)
- Sentiment Lexicon (Univ. Zurich)
- SentimentWortschatz
- SpinningBytes Swiss German Sentiment Corpus
- Official GermEval tools list
- GermEval 2015 data (Lexical Substitution)
- Germeval Task 2017
- GermEval-2018 data
- germeval-rug
- IWG_hatespeech_public
- jpadillamontani/germeval2018
- uhh-lt/GermEval2017-Baseline
- UKP embeddings for GermEval 2017
- CorZu (coreference resolution)
- Discourse Segmenter
- Frame Identification
- PropS-DE (proposition structures)
- Archiv für gesprochenes Deutsch
- BAS ressources
- Bochumer Korpus der gesprochenen Sprache im Ruhrgebiet
- Database for Spoken German (IDS Mannheim)
- (D)iscourse (I)nformation (R)adio (N)ews (D)atabase for (L)inguistic Analysis
- Hamburger Zentrum für Sprachkorpora
- kaldi-tuda-de
- bubenhofer.com/korpuslinguistik/kurs/
- CorpusExplorer v2.0 – Seminartauglich in einem halben Tag
- deeplearning4nlp-tutorial
- Uni Zürich: Sprachtechnologie in den Digital Humanities – MOOC Youtube & Coursera
- CLARIN VLO (DE+public)
- computerlinguistik.org
- LRE Map
- MetaShare Language Resources
- Peter Kolb's list
- Swiss German Language Processing
- GitHub topics corpus-linguistics & nlp
- nlp-datasets
- NLP-progress
- /r/LanguageTechnology/