This repository contains the code to geocode polling stations in Brazil. We leverage administrative datasets to geocode all polling stations used in elections from 2006 to 2022.
We detail our methodology and limitations of our method in this document. As we explain in that document, our method often performs better than commercial solutions like the Google Maps Geocoding Service, particularly in rural areas. Despite our best efforts, however, it is important to note that this procedure inevitably will make mistakes and consequently some coordinates will be incorrect.
The latest dataset of geocoded polling stations can be found in the compressed csv file linked to on the release page. Version notes can be found here.
The dataset (geocoded_polling_stations.csv.gz
) contains the following variables:
-
local_id
: Unique identifier for the polling station in a given election. This will vary across time, even for polling stations that are active in multiple elections. -
ano
: Election year -
sg_uf
: State abbreviation -
cd_localidade_tse
: Municipal identifier used by the TSE. -
cd_localidade_ibge
: Municipal identifier used by the IBGE -
nr_zona
: Electoral zone number -
nr_locvot
: Polling station number -
nr_cep
: Brazilian postal code -
nm_localidade
: Municipality -
nm_locvot
: Name of polling station -
ds_endereco
: Street address -
ds_bairro
: neighborhood -
pred_long
: Longitude as selected by our model. -
pred_lat
: Latitude as selected by our model -
pred_dist
: Predicted distance between chosen longitude and latitude and true longitude and latitude. For polling stations with coordinates provided by the TSE, this is set to 0. This variable can be used to filter coordinates based on their likely accuracy. -
tse_lat
: Latitude provided by the TSE. This is only available for a subset of data. -
tse_long
: Longitude provided by the TSE. This is only available for a subset of data. -
long
: Longitude as predicted by the model or provided by the TSE. -
lat
: Latitude as predicted by the model or provided by the TSE.
We also created panel identifiers that track a given polling station over time. Because panel identifiers provided by the electoral authorities can change over time, we must use a fuzzy matching procedure to create our own panel identifiers. The process implemented to generate the panel identifiers consists of six stages. First, we subset the data at the state level for each electoral year. Then, we generate every possible pair of polling stations at the municipality level for every consecutive electoral year. This can be as few as three possible pairs for the least populous municipality in Brazil, Serra da Saudade-MG, which had one polling station in 2006 and three in 2008, or as many as millions of pairs for the most populous municipality, São Paulo-SP, which has over 1,500 polling stations in each electoral year. The next step is to calculate the Jaro-Winkler string similarity for each possible pair on two strings: the normalized name and the normalized address of the location.
Subsequently, we use the Fellegi-Sunter framework for record linkage to choose the best matches as implemented in the reclin2
package. Specifically, we use an Expectation-Maximization (EM) algorithm to calculate the probabilities of a given pair being a match. We retain pairs with a probability greater than 0.5. To choose the final matches, we select the best matches under the constraint that each polling station can only be matched once. Finally, we construct the panel by combining the pairs matched in each consecutive year and establishing a unique panel identifier for those observations.
The dataset panel_ids.csv.gz
has the following variables:
ano
: yearpanel_id
: unique panel identifier. Units with the samepanel_id
are classified to be the same polling station in two different election years according to our fuzzy matching procedure.local_id
: polling station identifier. Use this variable to merge with the coordinates data.long
: This is a longitude variable that is constant for all observations with the samepanel_id
across years. To choose among coordinates from different years, we select the one with the smallest predicted distance to the true location. Ties are broken by selecting the longitude from the latest year.lat
: This is a latitude variable that is constant for all observations with the samepanel_id
across years. To choose among coordinates from different years, we select the one with the smallest predicted distance to the true location. Ties are broken by selecting the latitude from the latest year.
We used the open source language R (version 4.4.0) to process the files and geocode the polling stations. To manage the pipeline that imports and processes all the data, we use the targets
package.
Assuming all the relevant data is in the ./data
folder, you can reconstruct the dataset using the following code:
#Set working directory to project directory
setwd(".")
renv::restore() #to install necessary packages
targets::tar_make() # to run pipepeline
Options to modify how the pipeline runs (e.g. parallel processing options) can be found in the _targets.R
file. The pipeline is in the targets.R
file as well. We use the renv
package to manage package dependencies. To ensure that you are using the right package versions, invoke renv::restore()
when the working directory is set to the github repo directory.
Given the size of some of the data files, you will likely need at least 50GB of RAM to run the code.
While one can get disaggregated electoral data directly from the TSE, I recommend obtaining polling station-level data from CEPESP DATA, as it has been cleaned, aggregated, and standardized.
For merging with electoral data provided by the TSE, you will typically have to work with data reported at the "seção" level, which is below the polling station level. Generally, one will need to aggregate the "seção"-level data to the polling station level, using municipality code, electoral zone code, and polling station code. Once aggregated, you can then merge with the coordinates data provided here.
As an example, I provide code for merging the 2018 electorate data, which is reported at the "seção" level, with the coordinates data.
library(data.table) #for importing and aggregating data
polling_coord <- fread("geocoded_polling_stations.csv.gz")
#Subset on 2018 polling stations
coord_2018 <- polling_coord[ano == 2018, ]
#import 2018 electorate data from TSE
electorate_2018 <- fread("eleitorado_local_votacao_2018.csv", encoding = "Latin-1")
#aggregate data to the polling station level
electorate_local18 <- electorate_2018[, .(electorate = sum(QT_ELEITOR)),
by = c("CD_MUNICIPIO", "NR_ZONA", "NR_LOCAL_VOTACAO")
]
#merge by municipality, zone, and polling station identifier
coord_electorate18 <- merge(coord_2018, electorate_local18,
by.x = c("cd_localidade_tse", "nr_zona", "nr_locvot"),
by.y = c("CD_MUNICIPIO", "NR_ZONA", "NR_LOCAL_VOTACAO")
)
Because of the size of some of the administrative datasets, we cannot host all the data necessary to run the code on Github.
Datasets marked with a * can be found at the associated link in the table below but not in this Github repo.
All other data can be found in the data
folder.
Data | Source |
---|---|
2010 CNEFE* | IBGE FTP Server |
2017 CNEFE* | IBGE Website |
2022 CNEFE* | IBGE Website |
INEP School Catalog | INEP Website |
Polling Stations Geocoded by TSE* | TSE |
Polling Station Addresses | Centro de Política e Economia do Setor Público |
Census Tract Shape Files* | geobr Package |
Municipal Demographic Variables | Atlas do Desenvolvimento Humano no Brasil |
Thanks to:
-
Lucas Nobrega for help improving the panel identifier code.
-
Yuri Kasahara for ideas and assistance in debugging
-
George Avelino, Mauricio Izumi, Gabriel Caseiro, and Daniel Travassos Ferreira at FGV/CEPESP for data and advice
-
Marco Antonio Faganello for excellent assistance at the early stages of the project.
-
Spatial Maps at http://spatial2.cepesp.io