openoakland/opendisclosure

Create a graph showing who the top 5-10 contributors to each campaign are (people or company)

kleinlieu opened this issue · 6 comments

This is Question 1 of 5 From the Public Ethics Commission

@sdoumbouya and I are working on this one. Here are the fields:

  • Unique Key for Person: Tran_NamF + Tran_NamL + Tran_City + Tran_State + Tran_Zip
  • Tran_Amt1: Individual Transaction
  • Tran_Emp: Employer (If it is blank, then it is usually an entity: business, union, etc…)
  • Tran_Occ: Occupation
  • Entity_Cd: tells you if the contribution is from an individual or organization
  • If Tran_NamL is empty, then assume that you have an organization. The organization name is Tran_NamF

@kleinlieu you spoke of us having access to ec2 instances. I'd like to setup a server for @ted27 to dump his data to. Our piece will be easy to deal with once it is in a database of some sort.

Awesome. Let me get that information fouled you. In the meantime could you all use a Heroku Postgres development database for now?

Hey @sdoumbouya got the credentials! Let me know what kind of instance you need.

At last week's meetup someone mentioned that one issue here was the different ways people enter information. I took a look at the current data and looked into one way of resolving this.
Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance) tells how many changes are needed to turn one string into another. I found a MySQL implementation (http://www.artfulsoftware.com/infotree/qrytip.php?id=552) and tried some experiments on the company names I extracted from the current data set. Looking at a distance of less than 1/4 of the average of the two string lengths finds a good number of items that are permutations of each other. On the other hand it finds a good number of false matches, e.g: SRI International and RTI International, Oakland Unified School District and Berkeley Unified School District. Going to a lower distance may miss things like: Sheppard Mullin Richter & Hampton LLP and Sheppard, Mullin, Richter and Hampton which have a distance of 9.
This seems like an approach that might be useful to help with data cleaning, but cannot be automatic. The current implementation in MySQL is rather slow as well.

I'll connect y'all with one of my research guys who did a lot of company
name matching for our foreclosure ownership report, we did a lit manually
but learnt a lit in the process too..
On Apr 12, 2014 9:47 AM, "mikeubell" notifications@github.com wrote:

At last week's meetup someone mentioned that one issue here was the
different ways people enter information. I took a look at the current data
and looked into one way of resolving this.
Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance)
tells how many changes are needed to turn one string into another. I found
a MySQL implementation (
http://www.artfulsoftware.com/infotree/qrytip.php?id=552) and tried some
experiments on the company names I extracted from the current data set.
Looking at a distance of less than 1/4 of the average of the two string
lengths finds a good number of items that are permutations of each other.
On the other hand if finds a good number of false matches, e.g: SRI
International and RTI International, Oakland Unified School District and
Berkeley Unified School District. Going to a lower distance may miss things
like: Sheppard Mullin Richter & Hampton LLP and Sheppard, Mullin, Richter
and Hampton which have a distance of 9.
This seems like an approach that might be useful to help with data
cleaning, but cannot be automatic. The current implementation in MySQL is
rather slow as well.

Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-40285389
.