Date: 20.07.2020 Author: Andrés Ponce
- Introduction
- Scrapping Presidential Press Website
- Analysing text & structured topic modelling
- Final Thoughts
As a Public Policy graduate I see great challenges in open government policies and, in particular, access to public data. I started this project with the idea of using coding skills to gather and analyze public sources available to any citizen. As of January 2020, the presidential press website of Chile prensa.presidencia contained a large source of official speeches for president Piñera, since the time of his election to January 2020. These releases reflect the president´s communication strategy, even if they are subject to editorial control from presidential staff.
This project, runned completely in R, consists of two parts. First, the scrapping strategy, gathering speeches from March 2018 to Jan 2020. And second, structured topic modelling using date as covariate to understand how topic proportion change over time.
The scrapping process takes advantage of the URL structure https://prensa.presidencia.cl/discursos.aspx
by using the Rcrawler library 1 and Rvest 2. The speeches tab "discursos" contains 97 pages (at the time I did this). Each of these pages has at most 6 links to speeches, so 582 separate pages containing one speech each one. If we access a particular page we notice that each one of them has a URL pattern followed by a number https://prensa.presidencia.cl/discurso.aspx?id=135058
. This pattern is used to identify speech pages from Rcrawler
output.
Scraping text is straightforward with Rvest. I created a function containing three processes read_html()
, html_nodes()
, and html_text()
. I also used the same process to retrieve other useful information, such as date and speech title.
The number 582 speeches scrapped are shown by monthly count as follow:
Montlhy count of presidential speeches since Piñera took office in March 2018
Year | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2018 | 18 | 30 | 27 | 30 | 27 | 39 | 27 | 50 | 25 | 29 | 302 | ||
2019 | 37 | 22 | 18 | 20 | 29 | 22 | 27 | 25 | 25 | 23 | 11 | 12 | 271 |
2020 | 9 | ||||||||||||
Total | 46 | 22 | 36 | 50 | 56 | 52 | 54 | 64 | 52 | 73 | 36 | 39 | 582 |
By plotting the numer of times the Chilean president appears in public to give public speeches I found a substantial decrease after November of 2020. This is coincidental with the fact that, during this time, Chile experienced a social uprising leading the government to an all-time minimun rate of public support according to CADEM 3.
In the second step I apply the structural topic modelling with the stm
package. The stm packages allows the researcher to estimate a model using document covariates. In this case I used date to see how the proportion of topics varies across time (months). I choosed Plotly
(hosted in plotly studio: Click here) to visualize topic trends over time. For instance, the topic of Security & Crime is a recurrent topic in the president's speeches. Coincidentally, this topic shows a proportion spike by the end of 2018 and 2019, when the government suffered from police brutality scandals, first for killing an unarmed indigenous civilian and then for police repression in the social upheaval.
This exercise had no other purpose but to train coding skills and apply empirical methods to text data, and more specifically, to data that should be available to all citizens. However, it is important to point out that by the end of this project, the speeches from president Piñera are no longer available in the Press website. It is possible to access only speeches for the present month, without an option to access all the past speeches.