talk-of-norway

This repository makes available the v1.0.1 release of the Talk of Norway (TON) dataset, a collection of Norwegian parliament speeches from the 1998-1999 to 2015-2016 sessions. Every speech is richly annotated with metadata pulled from different sources, and augmented with sentence, token, lemma, part-of-speech and morphological feature annotations.

This work is inspired by the Talk of Europe CLARIN campus, and aims primarily at facilitating experimentation at the crossroads between quantitative Political Science and Natural Language Processing. The dataset is currently the core object of study of an interdisciplinary project involving the departments of Political Science and Informatics of the University of Oslo.

For more information on the Talk of Norway project and its participants, please see the UiO project pages at https://www.mn.uio.no/ifi/english/research/projects/ton/index.html

Dataset v1.0.1

The data is split in two main parts: the ./data/ton.csv file containing metadata (see Data.md for a description of the available variables) along with the raw text of the speeches, and the ./data/annotations/ folder containing the linguistic annotations of the speeches. The annotations in this folder are linked to their respective metadata row in the csv file by way of their file name, which is the same as the id variable.

The linguistic annotations their selves loosely follow the CoNLL format, with newline-separeted tokens and double newline-separated sentences. Every line contains tab-separated token-level annotations, following this pattern:

index token lemma part-of-speech features

For instance:

1    Ærede                ære                adj      fl|<perf-part>|tr1
2    medrepresentanter    medrepresentant    subst    appell|mask|ub|fl
3    !                    $!                 clb      <<<|<utrop>|<<<

Note that the morphological features in the fourth column are their selves separated with the pipe (|) character.

Sources

Linguistic annotations are automatically obtained using langid.py for language identification and the Oslo-Bergen tagger for morphological analysis as implemented in the Language Analysis Portal (LAP).

Metadata was pulled from several sources, utilizing a dump of the holder-de-ord database as a starting point and adding further information from the Storting api, scraping the [Storting web pages](Storting web pages) and integrating data from Søyland (forthcoming). See Data.md for more information on the variables.

Get the data

You can download the data from http://ltr.uio.no/ton/ton.data.101.tgz. The recommended way to stay up to date with this repository is to clone it and unpack the downloaded archive to its top-level directory.

On most UNIX systems, you can type the following in your terminal:

git clone https://github.com/ltgoslo/talk-of-norway
cd talk-of-norway
wget http://ltr.uio.no/ton/ton.data.101.tgz
tar -xzf ton.data.101.tgz
rm ton.data.tar.gz

How to cite

Publications connected to this dataset are forthcoming. For the time being, please use the following bit of bibtex to cite this work:

@online{Lap:Soy:16,
  author = {Lapponi, Emanuele and S{\o}yland, Martin G.},
  title = {Talk of Norway},
  year = 2016,
  url = {https://github.com/ltgoslo/talk-of-norway},
  urldate = {2016-10-29}
}

License