Data organization overhaul

Question

Data organization overhaul

Closed this issue 4 years ago · 0 comments

Background

Right now, there is room to make structural improvements to the /data folder, as one part of a larger effort in organizing the project for a more effective workflow. Also, Github puts a 100MB limit on each file an a 1GB limit on the size of repositories.

Issues

At the moment, we tell people to download from the Slack a file called SDWIS.zip, which is a 190mb file. There is no clean way to store this on Github other than to separate it out into multiple zip files. Additionally, there is other data such as UCMR data that is not included.
It's not clear what simple_time_based_model.zip is. If this is derived from a Jupyter notebook analysis, it doesn't need to be zipped for people to download.
While it is clear what UCMR3_All.zip is, it's not clear where it should go, and it's odd to set up a data folder by unzipping files from two different sources; it would make sense for the whole folder to just be in one place, such as the Slack.
There is no layout for what a final /data folder should look like.
We are not utilizing Github's branching functionality to its fullest extent, which would allow for messing around with unclean data without polluting the master branch.

Solution(s)

Create sources.md file. The most important prerequisite for a data folder is that it should be replicable. In this case, there should be a single document, say docs/data/sources.md, that walks through the steps of recreating this folder. It doesn't need to be complicated; it just needs to be documented. For example, a file called foo_data.csv might have been created by running a Python script myscript.py that scrapes a website, and another file called bar_data.csv may have had data that was downloaded directly from some link which is provided in the file. This would be sufficient documentation for a header document. The reason why it is important is so that people know where things come from.
Create README in /data folder. This should tell people how to populate the data folder, and should link to additional documentation in /docs on what is in the data folder.
Some data should be stored and other data should be recreated on an as-needed basis; there should be a delineation between these. For example, the SDWIS data should be stored on people's machines because it is large, integral to the project, and is cumbersome to recreate manually. However, something such as a Census series can be easily imported from the API with a couple lines of Python code, and additionally there are various series in the Census that we'd want to play around with and it's unclear which series would be incorporated into a final model, so this data probably should not be stored; instead, functionality for retrieving the data should be streamlined.
Centralize data setup process. At the moment, this is confined to simply downloading the repo and downloading SDWIS.zip. Part of the issue is it's not clear what should be centralized. Once this is figured out, we can centralize everything to make onboarding easier.
Keep only important or clean data on the master branch. Additional utilization of branching could make project management a lot easier for everyone, and this would let people freely work with more unclean data and half-done work without impeding the abilities of others to work off a clean project branch (i.e. master).