Repo for Career Track Capstone on OpenFEMA
The insights drawn in this project are drawn from three datasets, two of which (home owner and home renter records) are merged because of their overlap in information, and the third of which is summary data of all the disasters, of which there are 306--this is reflected in the notebook title "OpenFEMA-data-wrangling". All home owner/renter data was merged with the summary data so that every record held the disaster information as well. This produced a final working dataset of roughly 100,000 records and 40 variables, each record indexed by disaster number, then zip code, and then city.
First dataset: Home Owner Data
Second dataset: Renter Data
Third dataset: Disaster Summaries
All three of these were accessible via OpenFEMA's API and the urllib.request package.
More pressing than simply where the data came from, the condition of the data needs to be addressed: OpenFEMA proved to be some of the most sloppily kept records I could have chosen to deal with for such a project. Most notably, numerous misspellings led to records littered with zeros that would ultimately dictate how I could visualize the data. In Butte County, California, one disaster recorded how the city of "Paradise" was reimbursed for its loss. However, because there were individual records for each of "Paradise", "Paradsie", "Pardarise", and "Pradise", the information was mostly distorted. This was more or less solved (read: worked around) by aggregating disaster information and later by separately aggregating zip code information, but both of these approaches meant missing out of the robustness of the data. These solutions did not account for the mixed datatypes of so many of the columns. In the instance of the zipCode column in particular, cleaning involved dealing with some records that recorded their zip code as a float '0', a blank cell, or the string "00000", none of which are actual zip codes we could use.
Our final three dataframes would be 'df', 'agg_df', and 'zip_df', which held all records, data aggregated by disaster, and data aggregated by zip code respectively.
As mentioned, the home owner/renter data held many features concerning the elements of FEMA's payout for any particular disaster, zipcode, city, while the summary data held features concerning the nature of the disaster (such as type, name, start date etc.). There is also a third category of features, however, that were created for further analysis. This third category included:
- disasterLength: length of the disaster as a function of endDate and startDate; this feature will be often referred to as "Length"
- zip_count: number of unique zip codes that were effected by any particular disaster; a feature of our aggregated by disaster dataframe; this feature will often be referred to as "span" or "breadth"
- dis_freq: number of disasters from the past 20 years (the span of the data) that any one zipcode may have been a part of; a feature of our aggregated by zip code dataframe
Many of the features are self-explanatory, such as city or state. The less clear features are listed below:
- disasterNumber: Sequentially assigned number used to designate an event or incident declared as a disaster
- validRegistrations: Count of FEMA registration owners within the state, county, zip where the registration is valid. In order to be a valid registration the applicant must be in an Individual Assistance declared state and county and have registered within the FEMA designated registration period.
- approvedForFemaAssistance: number of FEMA applicants who were approved for FEMA's IHP assistance
- totalApprovedIhpAmount: total amount approved under FEMA's IHP program
- totalMaxGrants: count of valid registrations within the state, county, zip that received the max financial grant ($25000+)
- ihProgramDeclared: denotes whether the Individuals and Households program was declared for this disaster
- iaProgramDeclared: denotes whether the Individual Assistance program was declared for this disaster
- paProgramDeclared: denotes whether the Public Assistance program was declared for this disaster
- hmProgramDeclared: denotes whether the Hazard Mitigation program was declared for this disaster
- disasterType: Two character code that defines if the disaster is a Major Disaster Declaration (DR), Emergency Declaration (EM), Fire Management Assistance Declaration (FM), or Fire Suppression Authorization (FS)
- incidentType: Type of incident such as fire or flood. The incident type will affect the types of assistance available.
- disasterCloseOutDate: date all financial transactions for all programs are completed
The following packages were used to wrangle the data, plot the data, and then ultimately use machine learning strategies to predict:
import numpy as np
import pandas as pd
import requests
from urllib.request import urlretrieve
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import scipy.stats as stats
import sklearn
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.cluster import KMeans
While our clustering efforts here were "technically" successful, and at the very least correct, they do not reveal much. Just upon visual inspection, it is not hard to come to the conclusion that there really aren't distinct groups to be divided into clusters in the first place--not based on the features we have available.
Our multiple linear regression analysis falls under the same sort of assessment, though certainly stronger: successful, correct, not entirely revealing. Because so many of our features live to varying degrees under "post-disaster", they aren't helpful. However, we were able to select and create some features that could give us confidence in making predictions immediately after disaster strikes. That model gave us an R-squared of 0.562. When we consider the model that predicts costs after the disaster ends, we actually get a respectable R-squared value of 0.889. Such a metric should allow us to consider this model successful.
New .csv files were created over the course of wrangling this data that would save a lot of time for one trying to recreate my results here. If you are interested in them or in the .ipynb files themselves, feel free to reach out to me at gcoxexcel@gmail.com