This repository contains a set of standalone tools that can be used alongside an Arlo RLA audit.
In a risk limiting audit, the risk limit itself is a function of the margin of victory. In the hypothetical where an attacker can compromise the tallying process, the attacker can announce a total with a huge margin, and the RLA will then require very few samples. If the real margin was small, the number of samples should have been much larger.
The fix to this is to require the election official to:
- commit to the set of all ballots such that they cannot tamper with the set of electronic ballot records afterward (i.e., any request from the RLA, i.e., "give me ballot N" can be proven consistent with the commitment)
- prove that the election tally and margins of victory are consistent with this commitment (this is where all the e2e machinery comes into play)
With these, the RLA now has proof that it's working from the right starting point.
This idea, in various forms, has been around for several years. A recent paper from Benaloh, Stark, and Teague (slides) has exactly the idea that we want to build.
A nice property of this design is that it's completely "on the side" of the regular RLA process. You can do a RLA without the e2e part, and you still get something useful. The e2e just makes things more robust. You can layer it on to an existing RLA process without requiring changes to how votes are cast and tabulated.
And, of course, if you do happen to have voting machines that generate e2e ciphertexts, now those fit in very nicely here, so this scheme provides something of a stepping stone toward e2e technologies that allow voters to verify their votes were correctly tallied.
arlo_initialize_keys
: Creates a public/private key pair for the election administrator.
For future versions of arlo-e2e, threshold cryptography would be interesting because it
reduces the downside risk of the election administrator's key material being stolen.
Of course, that same election administrator already has the plaintext ballots.
arlo_tally_ballots
: Input is a file full of CVRs, in Dominion format, and the key file
from arlo_initialize_keys
. Output is a directory full of JSON files, including the
encrypted ballots, their associated proofs, the tallies, and other useful metadata.
This directory could be posted on a web site, making it a realization of the public bulletin board concept
that appears widely in the cryptographic voting literature. The election official might then
distribute the SHA256 hash of MANIFEST.json
, which contains SHA256 hashes of
every other JSON file in the directory, allowing for incremental integrity checking
of files as they're read.
arlo_verify_tally
: Input is a tally directory (the output of arlo_tally_ballots
). The
election private key is not needed. This tool verifies that the tally is consistent with all the
encrypted ballots, and that all the proofs verify correctly. This process is something that
a third-party observer would conduct against the public bulletin board.
arlo_ballot_style_summary
: Input is a tally directory, output is a summary of all the
contests and ballot styles. Demonstrates how to work with the metadata included
in a tally directory.
arlo_all_ballots_for_contest
: Input is a tally directory, output is a list of every ballot id
for ballots having the desired contest. Demonstrates how to work with the metadata included
in a tally directory.
arlo_decrypt_ballots
: Input is one or more ballot identifiers (same as the ballot file names, but without the .json
suffix),
the private key of the election, and the identifier(s) for the ballot(s) to be decrypted. Output ballots are written
to a separate directory, including both the plaintext and proofs of the plaintext's correctness. These are the ballots
that a "ballot-level comparison audit" would be considering.
arlo_decode_ballots
: Given some ballot identifiers (as above), prints everything we know about those ballots. If they
were previously decrypted, this will print their decryptions and verify their equivalence proofs. If the proofs don't
check out, this tool flags the issue. (Like arlo_verify_tally
, this tool would be used by a third-party observer
of an election audit.) For ballots that were never decrypted, we at least print the available metadata for the ballot.
The current arlo-e2e
codebase has detailed unit tests built with Hypothesis.
It can generate a random set of CVRs for a random election (varying the number
of contests, candidates per contest, and so forth), encrypt them and generate
all the proofs, serialize them to disk, read them all back in again,
and verify all the proofs.
On top of this, we have all the command-line tools, listed above, and we have code that can use all of the cores of a multicore computer to accelerate the process on a single computer. In progress as well is code to support cluster parallelism for the encryption/tallying process, so we can scale to large elections. We already have prototyped tallying code using Ray.
Not in version 1 but on the future wishlist:
- Some sort of binary file format or use of a real database to store all the encrypted CVRs. Gzip on our encrypted ballots reduces them to 2/3 or their original human-readable size.
- Some sort of integration with Arlo, rather than running as a standalone tool.
- Some sort of web frontend, rather than the command-line tools.
This code builds on the ElectionGuard Python Implementation, which itself was built to be a complete implementation of the ElectionGuard spec, including many features that we're not using here, such as threshold cryptography.
Other libraries that we're not using, but ostensibly could at some point:
- Verificatum
- Code is all MIT licensed
- Core libraries in C
- Alternative implementations in Java (server-side) and JavaScript (for embedding in a hypothetical voting machine)
This part of the implementation is currently in need of some generalization and cleanup. But how it works right now.
There are two YAML configurations for Ray in the cloud
directory. One
of them aws-config.yaml
works and is tested regular. The other,
azure-config.yaml
is more of a work in progress and should not
be assumed to be ready to go.
A third file, iam-policy-summary.txt
is something of an outline of how
we had to specify the AWS security policies (IAM) in order to ensure that
our EC2 virtual machines can speak to our S3 buckets. These almost certainly
are far less than optimal. When in doubt, when making a new S3 bucket,
you'll be dorking with the IAM policies until you can get everything
mounted correctly with s3fs
.
Within aws-config.yaml
:
-
The current specified
max_workers
andinitial_workers
are the biggest we could run without triggering a weird crashy behavior in Ray. They're working on it. Other Ray loads use far more nodes than arlo-e2e, so it's something about the way we stress the cluster that's different. -
The "worker" nodes we're currently using are
c5a.16xlarge
(beefy 64 vCPU AMD machines), withc5.16xlarge
(similarly beefy Intel machines as an alternate), with am5a.xlarge
(four vCPUs but lots more memory) that we use for our "head" node. -
The Ray autoscaler does all the work of creating and destroying our nodes on EC2. It starts from an Ubuntu 20.04 VM, pre-installed with dependencies that we need (you'll see the various
ami-
strings in there). Once it launches each VM, it then executes the varioussetup_commands
. Mostly these just install more things, but they also set ups3fs
, which mounts the S3 storage bucket we want onto/mnt/arlo-data
. -
If you want to try to make your own Ubuntu virtual machine, you may find the script in
ubuntu-setup.sh
to be helpful. This probably could/should be done just as well with Docker, but we get where we need to be without it, for now. -
Right now, everything is kinda arbitrarily set up to work in Amazon's
us-east-2
datacenter. There's no reason you couldn't run this elsewhere. Just make sure the S3 storage and EC2 compute nodes are in the same data center to avoid crazy data charges.