This repository contains the materials for workshops held at the MZES Social Science Data Lab and at the Connected_Politics Lab. It includes the slides workshop_slides.pdf and an example text data set multilingual_data_annotated_translated.csv which is a subset of the REMINDER media corpus and a R script multilingual_lemmatizing_translation.Rmd to perform lemmatization and machine translation with a multilingual corpus.

Workshop Outline

Automated text analysis methods have become popular in computational social science. They appeal as they promise the automated extraction of meaning from large numbers of documents, thus allowing to better understand the contents and, indirectly, the document creators and audiences. While the existing techniques are well established for English-language text, the situation is different when it comes to the study of text in more than one language and in languages other than English. Yet, it is precisely these multilingual techniques that are needed for (country) comparative research designs.

This workshop motivates the need for comparative social science studies that base their interpretations on text data. The main part provides guidance and many practical tips to help plan such research designs. In particular, it covers considerations related to the definition of comparative research goals, the selection of a case comparative text data set, the definition of concepts, and the creation of a human annotated validation baseline. The workshop focuses then on methodological strategies that can be employed to obtain measurements from a multilingual corpus with automated text analysis methods. All steps are illustrated with an applied example.

Text Classification for Comparative Research

Multilingual dictionaries and materials to implement supervised text classification for multilingual corpora can be found here.

Topic Modeling for Comparative Research

Instructions that facilitate the implementation of PLTM (Mimno et al., 2009) are collected in this repository.

More resources for multilingual text analysis

A Tutorial on lemmatization with udpipe (Wijffels, 2021).

A Tutorial on machine Translation with deeplr (Zumbach & Bauer, 2021).

OPTED Living hub a knowledge base for multilingual computational text analysis.

Further readings

Esser, F., & Vliegenthart, R. (2017). Comparative research methods. The International Encyclopedia of Communication Research Methods. Link.

Lind, F. (2021). Multilingual Automated Content Analysis for Comparative Communication Research. (Doctoral Dissertation, University of Vienna). Happy to send you a copy.

Livingstone, S. (2003). On the challenges of cross-national comparative media research. European Journal of Communication, 18(4), 477–500. Link.

Lucas, C., Nielsen, R. A., Roberts, M. E., Stewart, B. M., Storer, A., & Tingley, D. (2015). Computer-assisted text analysis for comparative politics. Political Analysis, 23(2), 254–277. Link.