/ncdc_storm_events

A data cleaning project for the NCDC Storm Events database

Primary LanguageRGNU General Public License v3.0GPL-3.0

ncdc_storm_events

ncdc_storm_events is a project that downloads the NCDC Storm Events database from the National Oceanic and Atmospheric Administration.

The NCDC Storm Events Database is a collection of observations for significant weather events. The "database" contains information on property damage, loss of life, intensity of systems and more.

The dataset is updated somewhat regularly, but is not real-time. As of this writing, the dataset covers the time period January, 1950, to Aug, 2018.

There are three tables within the database:

  • details

  • fatalities

  • locations

The details dataset is the heaviest raw dataset, weighing over 1.1G when combined and saved as CSV. locations is much smaller; only 75M (CSV) with fatalities a featherweight at 1.2K.

details also happens to be the dirtiest. With 51 columns, at least a dozen of these are unnecessary - mostly related to date or time observations (there are 13 variables total, all redundant, for date/time info).

fatalities has the same issue with dates and times as details, but not nearly on the same scale.

locations, though treated as its own dataset, is very comparable to the location data within details and much of this information is redundant.

The entire database, imo, is a great project for exploring, tidying and cleaning somewhat large datasets. The challenge comes in ensuring data you may remove (what seems to be redundant) should in fact be removed.

Data Source

The datasets are broken down by table, then further broken down by year of the observations. They are stored in csv.gz format on the NCDC NOAA FTP server.

Each file is named like,

StormEvents_{TABLE}-ftp_v1.0_d{YEAR}_c{LAST_MODIFIED}.csv.gz

where TABLE is one of details, locations, or fatalities, YEAR is the year of the observations, and LAST_MODIFIED is the last datetime modification of the archive file.

Downloading Data

Files are downloaded with ./code/01_get_data.R. All csv.gz datasets are bound together into one dataframe for each of the three tables. These raw data files are saved in the data directory.

Tidying Data

All three datasets can be tidied to some extent using ./code/02_tidy_data.R. The details dataset is the worse with over a dozen date and/or time variables. These variables are dropped and BEGIN_DATE_TIME and END_DATE_TIME are reformatted to YYYY-MM-DD HH:MM:SS format as a character string. Timezone information is not saved. Though it is included in the dataset, it is near-completely incorrect.

Additionally, date and time variables in fatalities and locations are also modified to remove redundancy or, in the case of locations which matches the date/time values in details, have been completely removed.

The damage variables in details have also been modified. The raw data uses alphanumeric characters; for example, "2k" or "2K" for $2,000 and "10B" or "10b" for $10,000,000,000. These have been cleaned to integer values.

Lastly, the narrative variables (EPISODE_NARRATIVE and EVENT_NARRATIVE) are split out to their own respective dataset to avoid redundancy and reduce the size of the other datasets.

Where the raw data is well over 1.2G, the entire dataset, after tidying, sits at 539M.

All tidied datasets are located in the output directory.

Removing data file history from git

When updating data, this repo will use BGP Repo-Cleaner to remove the history of data files from the repository. When this is done, the repository will need to be removed from production environments in favor of a fresh clone

Getting Started

Prerequisites

Required Packages

  • DT 0.5

  • kableExtra 1.0.1

  • maps 3.3.0

  • mapproj 1.2.6

  • rnaturalearth 0.1.0

  • tidyverse 1.2.1

  • viridis 0.5.1

  • workflowr 0.2.0

Built With

  • R 3.5.2 - The R Project for Statistical Computing

Contributing

Please read Contributing for details on code of conduct.

Authors

License

GNU GENERAL PUBLIC LICENSE