/Thesis

Code for Thesis

Primary LanguageRCreative Commons Zero v1.0 UniversalCC0-1.0

Replication Data for Thesis "Measuring Polarization in Parliamentary Debates"

Data

Raw Data

The main data source constitutes the database of the Open Discourse Project. The organization provides a database of all parliamentary debates from the German Bundestag between 1945 and 2020. They directly retrieve their data from the German Bundestag Open Data Portal, which provides the transcripts in PDF and XML format. The parsed data can be freely downloaded from Harvard Dataverse. For this study, only the speeches.csv dataset was used, which holds a total of 899.526 speeches together with some metadata, such as speaker, data, session and direct link to the PDF protocolle from the Bundestag Website.

Citation: Richter, F.; Koch, P.; Franke, O.; Kraus, J.; Kuruc, F.; Thiem, A.; Högerl, J.; Heine, S.; Schöps, K., 2020, "Open Discourse", https://doi.org/10.7910/DVN/FIKIBO, Harvard Dataverse

Link to documenation and information system (DIP)

In order to filter the speeches for a specific topic (which is a basic condition for the ideological scaling and ML algorithms), it is necessary to also connect data from the German Dokumentations- und Informationssystem für Parlamentsmaterialien (DIP), which is the official documenation and information system for parliamentary materials. Here, users can sort parliamentary texts using several filters, such as time period, subject area or transaction type.

The filters used for the query were:

  • Format: Aktivität
  • Datum: Between January 1, 2020 to December 31, 2020
  • Schlagworte: Covid-19
  • Dokumentart: Bundestag-Plenarprotokoll
  • Aktivitätsart: Reden, Wortmeldungen im Plenum & Rede

After specifying all the filters, the search results were downloaded in a word document and saved under data/Dip-Export.docx (see Figure).

Exporting DIP

Merging Data

In order to filter the primary data/speeches.csv dataset based on the selection in the data/Dip-Export.docx, the files were parsed by running the the following script.

Scripts/Filter_Speeches.R

Ultimately, the output of the script provides the filtered dataset, which can be used for further analysis.