
Attempt to use csvdedupe from the dedupe project to normalize the Contributor Name in DC Campaign Finance Data.

The parent project is DC Campaign Finace Watch part of CodeForDC.


This project uses Python 3.5.X with numpy. Having Node installed would also be helpful.

pip3 install --upgrade numpy
pip3 install --upgrade csvkit
pip3 install --upgrade csvdedupe


View Deduped Data

  • Open output.csv in Excel.
  • Sort by Cluster ID (the first column).


csvdedupe DC_contribs_since_2007.csv --field_names "Contributor Name" --output_file output.csv

Check number of training examples completed

node -p "x = require('./training.json'); x['distinct'].length"

Extract high level descriptive stats from CSV

csvstat DC_contribs_since_2007.csv

Beginner Setup

brew install python
brew install homebrew/python/numpy
brew install node