This is a directory set up to make it easy to run PyEI
and store the outputs in a helpful way.
First, you should clone this repository and start working on your own branch, so that your outputs do not collide with anyone else's.
git clone https://github.com/gabeschoenbach/ei.git
cd ei
git checkout -b {YOUR_BRANCH_NAME}
TODO: determine the minimal installations necessary to make this project runnable.
The first — and most challenging — part of the EI pipeline is preparing an accurate dataset that can be fed into the PyEI
machinery. This data can be in a tabular (CSV) format or stored in a shapefile. You should make a shapes/
folder at the root of this directory and put your shapefile or CSV inside of it. For example, if you want to run EI on a Massachusetts 2020 VTD shapefile stored in a folder called MA_vtd20
, you should put that folder in shapes so that ei/shapes/MA_vtd20/
is a valid path.
The minimal information needed on your input data is the following:
- Demographic columns: a total electorate column like
VAP
orCVAP
and then a breakdown into racial groups, i.e.WVAP
,BVAP
,HVAP
,ASIANVAP
,NHPIVAP
,AMINVAP
,2MOREVAP
,OTHERVAP
. These columns should have counts for how many voters in each group reside in each geographical unit (row). - Election columns: A column for the vote totals for a given candidate. For example,
BMAY21PAG
refers to the votes that Annissa Essaibi George won in Boston's 2021 Mayoral primary race.
Optional columns could include:
- A county column, helpful if you want to run EI on individual counties.
Once you have prepared your input data, you should add its information to the info.py
file. Every script in this directory relies on info.py
to understand the data, so it's important to carefully fill it out when you add new data.
"MA_vtd20": {
"shapefile_path": "shapes/MA_vtd20",
"COUNTY_COL": "COUNTYFP20",
"counties": [None],
"elections": {
"BMAY21P": {
"POP_COL": "VAP",
"candidates": {
"BMAY21PAG": {
"name": "Annissa George",
"race": "W",
"won": True,
},
"BMAY21PAC": {
"name": "Andrea Campbell",
"race": "B",
"won": False,
},
"BMAY21PKJ": {
"name": "Kim Janey",
"race": "B",
"won": False,
},
"BMAY21PMW": {
"name": "Michelle Wu",
"race": "A",
"won": True,
},
},
},
},
},
The above blob contains information about the MA_vtd20
file that contains just one election — the 2021 Boston Mayoral race — labeled BMAY21P
. Inside, you specify the population basis you want to use for the election — since the MA_vtd20
shapefile has VAP
, WVAP
, BVAP
, ... columns, we specify VAP
. For each candidate, we specify a well-formatted name
that will be used in the plots, along with the candidate's race
and whether they actually won the election (in this example, the top two candidates advance, which is why Wu and George both have won=True
). Neither race
nor won
is currently used, but it's helpful to keep track of. COUNTY_COL
and counties
can be left as the empty string and [None]
, respectively, if you aren't planning on running EI on individual counties.
The main EI script is ei.py
. It accepts a -state
argument that refers to the key in info.py
that points to the data you're interested in (in our example, this would be MA_vtd20
). The -elec
argument determins which election EI will be run on, and you can specify as many -g
arguments as groups you want to divide the electorate into. Optionally you can set a -county
argument to run it on just one county. Lastly, you can specify a -num_tunes
and -num_draws
argument to set how long to run the MCMC process that happens under the hood of EI. The default is set to 10 for testing, but you should try 1000 for each to get good convergence.
For example, if we wanted to run EI on the 2021 Boston Mayoral primary on our Massachusetts shapefile, breaking the electorate into Black and non-Black voters, we would run:
python run_ei.py -state MA_vtd20 -elec BMAY21P -g BVAP -num_tunes 1000 -num_draws 1000
And if we wanted to run it while dividing the electorate into White, Black, Hispanic, and all other voters, we would run:
python run_ei.py -state MA_vtd20 -elec BMAY21P -g WVAP -g BVAP -g HVAP -num_tunes 1000 -num_draws 1000
This process will create (if not already there) an outputs/MA_vtd20/
folder in which to store results. Standard output and error will be printed to the terminal, but an outputs/MA_vtd20/ei/
subfolder will be created that stores the EI outputs as a .pickle
file, along with a .csv
file of the final input data in outputs/MA_vtd20/final_inputs/
. Each of these files will be named according to the EI run that it corresponds, which will be of the form {state}_{election}_{groups}
. As an example, the EI result from last Python call above would be saved as outputs/MA_vtd20/ei/MA_vtd20_BMAY21P_WVAP-BVAP-HVAP.pickle
.
You can summarize and visualize the outputs of a given EI run by calling the summary.py
and viz.py
scripts, respectively. On the above EI run, calling
python summary.py -state MA_vtd20 -elec BMAY21P -g WVAP -g BVAP -g HVAP
python viz.py -state MA_vtd20 -elec BMAY21P -g WVAP -g BVAP -g HVAP
would create outputs/MA_vtd20/summaries/
and outputs/MA_vtd20/plots/
respectively. The subfolders in those directories that include the phrase turnout_adjusted
refer to summaries and plots that compute candidate preferences only out of voters inferred to have participated in the election — that is, voters that had not chosen to abstain or vote for an unlisted candidate.
The submit_jobs.py
script creates and runs (via sbatch
) a job.sh
file that calls ei.py
, viz.py
, and summary.py
for every election in the elections
section of your state in info.py
. Hardcoded into the script (but easily editable) is the choice of demographic groups you want in the inference. As of 10/25/21, its set to WVAP, BVAP, HVAP
(and will implicitly group all the rest of the population into OVAP
, "Other"). To run this on the cluster, simply run
python submit_jobs.py -state MA_vtd20 -num_tunes 1000 -num_draws 1000
If you only want to run it on one election, you can specify so with an optional -elec
argument.