/foia-online

Analysis related to article on FOIA Online Database.

Primary LanguageJupyter Notebook

Analysis related to FOIA Online Database.

CJR - What makes a good FOIA request? We studied 33,000 to find out.

  • Code by Nic & Rashida

1/31/2017: We're hoping to have all of our code up tonight. Important note: the current "analysis" notebook is not what the conclusions of our article are based on, but rather, simpler "gut-checks" to poke through the data, guided by some of the terms that came up in our text analysis and decision tree analysis. That code will be posted shortly!


We scraped all requests for years 2011 through 2016 available at https://foiaonline.regulations.gov/ and filtered out requests that were still under agency review.

Two categories of features were considered: ones we derived ourselves and others indicating the presence of specific word in a request. We stemmed words using a lemmatizer before generating a document-term matrix. Below is a complete list of our derived features:

  • Character count
  • Word count
  • Sentence count
  • Average sentence length
  • Whether the request mentioned the Freedom of Information Act
  • Whether the request mentioned fulfillment fees
  • Whether the request included a phone number
  • Whether the request included a hyperlink
  • Whether the request included an email address
  • Whether the request included a date
  • Whether the request made a reference to place
  • A readability score derived from character, word and sentence counts
  • Whether the request mentioned strings associated with data, like data formats
  • A specificity score, which amounts to the number of groups of nouns
  • Whether the request included an EPA identification number

We used decision trees as a first step in our search for important features.

Two methods of classifying requests were used. One limited our definition of a successful request to full grants. Another additionally considered partial grants to be successes. We did not observe a significant difference in our decision trees when switching our method of classification. This is likely because the number of partial requests was low. In light of this lack of difference, we used our stricter definition of success.

When trying to draw conclusions about FOIA requests in general--that is, across all agencies--we had to account for the fact that the FOIAOnline database had more requests from some agencies. We did this by repeatedly (1,000 times) taking equally-sized samples from each agency. A more detailed description of how we did this can be found in the comments of our R scripts.