A curated list of awesome resources for Danish language technology
- Danish Gigaword - Collection of Danish corpora (as of May 2020 the corpus is not openly available).
- OSCAR - Danish corpus derived from the Common Crawl corpus. Described in Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures (Scholia)
- CLARIN-DK-UCPH
- The Danish Parliament Corpus 2009 - 2017, v1. The license is Creative Commons - Attribution 4.0 International
- Grundtvig's Works Corpus. Not for commercial use as the license is Creative Commons - Attribution-NonCommercial 4.0 International.
- DK-CLARIN Reference Corpus of General Danish Only for academic use.
- SemDaX - POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only.
- NOMCO - "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ Scholia ]
- Danish Propbank - commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles.
- Danish Dependency Treebank v. 1.0 - Matthias Trautner Kromann et al.'s dependency annotation of some texts from PAROLE.
- Mr. Bean corpus - Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in Tekststrukturering pa italiensk og dansk
- Køge Corpus - Danish-Turkish transcribed corpus by Jens Normann Jørgensen.
- Danske taler - Collection of Danish speeches. API available at https://dansketaler.dk/wp-json/wp/v2/tale
- DKhate - corpus of 3600 hate speech from Twitter and Reddits as well as news comments (to appear in 2020)
- DaNewsroom - Danish summarization dataset. Probably to appear in 2020. Described in DaNewsroom: A Large-scale Danish Summarisation Dataset (Scholia)
- Wikipedia
- wiki40b/da - Clean-up text from Danish Wikipedia. Described in Wiki-40B: Multilingual Language Model Dataset. (Scholia)
- XED - emotion annotated movie subtitles. Described in XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection (Scholia).
- DaN+ - annotated for nested named entities on top of the entire Danish Universal Dependencies (UD_Danish-DDT) and 3 new web domains and includes lexical normalization. Described in DaN+: Danish Nested Named Entities and Lexical Normalization
- Europarl, parallel sentences between Danish and English from the European Parlament.
- JW300 - "a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average"
- OpenSubtitles2018 - Parallel corpus from movie and tv subtitles. Described in OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles.
- Tatoeba - Sentences
- WikiMatrix, parallel sentences from Wikipedias. 1620 language pairs, including Danish
- DanPASS - Described in DanPASS - A Danish Phonetically Annotated Spontaneous Speech corpus (Scholia)
- DK-Parole
- LANCHART
- Common Voice - Crowdsourced multilingual voice dataset. As of 18 December 2019 there is no Danish data. Described in Common Voice: A Massively-Multilingual Speech Corpus (Scholia)
- NST
- NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
- NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
- NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.
- NST-lexical-database A pronunciation dictionary compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service.
- DanNet DanNet, Danish Wordnet (v 2.2) - owl format - Danish wordnet with three-clause BSD-like license.
- Retskrivningsordbogen. The official Danish spelling dictionary digitally available under its own special license.
- Opslagsord og ordklasser in CSV format.
- Lexemes, word classes and inflections. Excerpt in the CSF format available. Full list presumably available upon request.
- Lexemes, word classes, inflections, grammatical information, hyphenation and usage examples in XML. Full list presumably available upon request.
- Stavekontrolden - word list with 160132 Danish words. Used, e.g., for spelling suggestion in LibreOffice. Licensed under GPL, LPGL, and MPL.
- The Concise Danish Dictionary/The Comprehensive Danish Dictionary/Den Store Danske Ordliste (DSDO), word list created by Skåne Sjælland Linux User Group and distributed under a GPL license
- In Debian-based distributions the word list may be installed with
sudo aptitude install aspell-da
and extracted withspell -d da dump master
.
- In Debian-based distributions the word list may be installed with
- Interactive Terminology for Europe (IATE) - European Union terminology database. October 2020 version contains over 500,000 Danish terms.
- The Danish FrameNet Lexicon, 40,267 lines resource containing 5,300 verbs and 6,490 verbal nouns
- Wikidata lexemes - structured database with metadata about lexemes, their forms and their sense. Over 300,000 lexemes including over 6,700 Danish lexemes in October 2020.
- Overview over Danish lexemes in Ordia - webapp with overview of content of Wikidata lexemes based on SPARQL queries.
- Wikidata lexemes latest lexemes dump in ttl - official dump of lexeme-only part of Wikidata.
- NST-ngrams - A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM.
- AFINN - Danish lexicons annotated for sentiment.
- concreteness-estimates-da - Bill D. Thompson's concreteness estimates for Danish words, as detailed in Automatic Estimation of Lexical Concreteness in 77 Languages (Scholia).
- Danish Swadesh List - List of Danish words of basic concepts from The Rosetta Project.
- Sketch Engine - cloud service with wordlists, thesearus, collocations, n-grams etc. Free for academic use in the European Union and paid service for commercial use.
- Danish-Similarity-Dataset - Similarity scores for 99 Danish word pairs by Nina Schneidermann and Bolette Sandford Pedersen. Also available in danlp.
- Wordsim353-da - Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set. Also available in danlp.
- Four words - 100 odd-one-out sets of 4 words or phrases.
- cc.da.300 (bin file GB large) - fastText-trained embedding on Danish part of Common Crawl and Danish Wikipedia. Read more about the method in Learning Word Vectors for 157 Languages (Scholia).
- wiki.da (bin+text file) - fastText-trained embedding on Danish Wikipedia. Read more about the method in Enriching Word Vectors with Subword Information (Scholia).
- Byte-Pair Encoding embedding - Gensim-based subword embedding. A large number of Danish embeddings are available. They differ in the size of the vocabulary (from 1000 to 200000) and subspace dimensions (from 25 to 300).
- NLPL word embeddings repository - NLPL word embeddings repository by Language Technology Group at the University of Oslo. Two Danish embedding models as of November 2020.
- Danish NLPL word embedding - 100-dimensional word2vec skipgram model trained by Andrey Kutuzov based on the Danish CoNLL17 corpus.
- Danish DSL and Reddit word2vec word embeddings - 300-dimensional CBOW word2vec word embedding by Emil Middelboe and Anders Lillie trains on Danish DSL corpus and Reddit.
- Danish BERT - Botxo/Møllerhøj's Weights for a BERT trained on a large Danish corpora.
- Danish ELECTRA - Philip Tamimi-Sarnikowski's Danish ELECTRA model. Available in the transformer library.
- ConvBERT - Philip Tamimi-Sarnikowski's model
- Danish ELMo on OSCAR - (Link does not work as of December 2020)
- Ælæctra - Malte Højmark-Bertelsen's Danish Gigaword-trained Electra-based model
- Multilingual sentence transformers - Pre-trained multilingual sentence transformers,
- wiki40b-lm-da - language model trained on Danish from Wiki40B dataset
- Lemmy - Lemmatizer for Danish in Python.
- cstlemma - lemmatiser.
- spaCy - Python-based package with lemmatization.
- spaCy - Python-based named entity extraction
- daner - Named entity extraction.
- flair+danlp ner-tagger - Flair NER tagger trained by the Alexandra Institute.
- Polyglot named entity extraction -
- DBpedia Spotlight - DBpedia-based entity linker. Described in Improving Efficiency and Accuracy in Multilingual Entity Extraction (Scholia)
- afinn - Python package with AFINN Danish lexicon annotated for sentiment, also installable with
pip install afinn
. - Sentida - R package With Danish sentiment lexicon and handling of, e.g., negation. Detailed in SENTIDA: A New Tool for Sentiment Analysis in Danish (Scholia).
- Hisia - Python package with pre-trained machine-learning based Danish sentiment analysis by Prayson Wilfred Daniel.
- danspeech - DeepSpeech2-based Danish speech recognition in Python
- kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.
- espeak - An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe.
- ResponsiveVoice - Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.
- Google Cloud Text-to-Speech - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish.
- Amazon Polly - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at TTSMP3.
- DaNLP - "a repository for Natural Language Processing resources for the Danish Language."
- dapipe - Danish UD-pipe: tokenisation, lemmatisation, PoS tagging, morphology, dependencies.
- UDPipe - Non-language specific version of dapipe. Newer version of the Danish-DDT model than that which is offered by dapipe is available at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998
- DKIE - GATE pipeline including wrapped Danish models for Stanford CoreNLP.
- StanfordNLP. Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available.
- bornholmsk - Datasets and embeddings for the Bornholmsk dialect.
- spaCy - Python-based natural language processing package
- dacy - Danish spaCy pipeline.
- ELEXIS Monolingual Word Sense Alignment Task - Predicting the relationship between two senses in each of several languages, including Danish.
- OffensEval 2020 - Danish - Offensive Language Identification in Social Media competition. Described in Offensive Language and Hate Speech Detection for Danish (Scholia)
- Danish resources - Finn Årup Nielsen's PDF with pointers to Danish resources.
- Scholia's topic aspect for Danish, works (mostly scientific articles) about "Danish" as listed in Wikidata.
- DaNLP - Alexandra Institute's list of Danish resources
- Language Technology Resources for Danish, list from Det Dansk Sprog- og Litteraturselskab
- European Language Resources Association (ELRA) list for Danish, list of various annotated corpora available for purchase with both commercial and non-commercial licenses.
- sprogteknologi.dk - List of Danish language resources. Compiled by the Agency for Digitisation.