/awesome-danish

A curated list of awesome resources for Danish language technology

OtherNOASSERTION

Awesome Danish

A curated list of awesome resources for Danish language technology

Data

Corpora

Parallel corpora

Spoken language corpora

  • DanPASS - Described in DanPASS - A Danish Phonetically Annotated Spontaneous Speech corpus (Scholia)
  • DK-Parole
  • LANCHART
  • Common Voice - Crowdsourced multilingual voice dataset. As of 18 December 2019 there is no Danish data. Described in Common Voice: A Massively-Multilingual Speech Corpus (Scholia)
  • NST
    • NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
    • NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
    • NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.

Dictionaries and ontologies

  • NST-lexical-database A pronunciation dictionary compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service.
  • DanNet DanNet, Danish Wordnet (v 2.2) - owl format - Danish wordnet with three-clause BSD-like license.
  • Retskrivningsordbogen. The official Danish spelling dictionary digitally available under its own special license.
    • Opslagsord og ordklasser in CSV format.
    • Lexemes, word classes and inflections. Excerpt in the CSF format available. Full list presumably available upon request.
    • Lexemes, word classes, inflections, grammatical information, hyphenation and usage examples in XML. Full list presumably available upon request.
  • Stavekontrolden - word list with 160132 Danish words. Used, e.g., for spelling suggestion in LibreOffice. Licensed under GPL, LPGL, and MPL.
  • The Concise Danish Dictionary/The Comprehensive Danish Dictionary/Den Store Danske Ordliste (DSDO), word list created by Skåne Sjælland Linux User Group and distributed under a GPL license
    • In Debian-based distributions the word list may be installed with sudo aptitude install aspell-da and extracted with spell -d da dump master.
  • Interactive Terminology for Europe (IATE) - European Union terminology database. October 2020 version contains over 500,000 Danish terms.
  • The Danish FrameNet Lexicon, 40,267 lines resource containing 5,300 verbs and 6,490 verbal nouns
  • Wikidata lexemes - structured database with metadata about lexemes, their forms and their sense. Over 300,000 lexemes including over 6,700 Danish lexemes in October 2020.
  • NST-ngrams - A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM.
  • AFINN - Danish lexicons annotated for sentiment.
  • concreteness-estimates-da - Bill D. Thompson's concreteness estimates for Danish words, as detailed in Automatic Estimation of Lexical Concreteness in 77 Languages (Scholia).
  • Danish Swadesh List - List of Danish words of basic concepts from The Rosetta Project.
  • Sketch Engine - cloud service with wordlists, thesearus, collocations, n-grams etc. Free for academic use in the European Union and paid service for commercial use.

Word sets

  • Danish-Similarity-Dataset - Similarity scores for 99 Danish word pairs by Nina Schneidermann and Bolette Sandford Pedersen. Also available in danlp.
  • Wordsim353-da - Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set. Also available in danlp.
  • Four words - 100 odd-one-out sets of 4 words or phrases.

Embeddings

Neural models

Tools

Lemmatization

  • Lemmy - Lemmatizer for Danish in Python.
  • cstlemma - lemmatiser.
  • spaCy - Python-based package with lemmatization.

Named entity recognition

Entity linking

  • DBpedia Spotlight - DBpedia-based entity linker. Described in Improving Efficiency and Accuracy in Multilingual Entity Extraction (Scholia)

Sentiment analysis

  • afinn - Python package with AFINN Danish lexicon annotated for sentiment, also installable with pip install afinn.
  • Sentida - R package With Danish sentiment lexicon and handling of, e.g., negation. Detailed in SENTIDA: A New Tool for Sentiment Analysis in Danish (Scholia).
  • Hisia - Python package with pre-trained machine-learning based Danish sentiment analysis by Prayson Wilfred Daniel.

Automatic Speech Recognition

  • danspeech - DeepSpeech2-based Danish speech recognition in Python
  • kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.

Speech Synthesis (text-to-speech)

  • espeak - An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe.
  • ResponsiveVoice - Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.
  • Google Cloud Text-to-Speech - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish.
  • Amazon Polly - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at TTSMP3.

Fundamental processing

  • DaNLP - "a repository for Natural Language Processing resources for the Danish Language."
  • dapipe - Danish UD-pipe: tokenisation, lemmatisation, PoS tagging, morphology, dependencies.
  • UDPipe - Non-language specific version of dapipe. Newer version of the Danish-DDT model than that which is offered by dapipe is available at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998
  • DKIE - GATE pipeline including wrapped Danish models for Stanford CoreNLP.
  • StanfordNLP. Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available.
  • bornholmsk - Datasets and embeddings for the Bornholmsk dialect.
  • spaCy - Python-based natural language processing package
  • dacy - Danish spaCy pipeline.

Competitions

Resources about resources