data-pipeline: An R repository from JohnField

Campaign Lab Data Pipeline

What?

We want to be able to structure our dataset (see "Campaign Lab Data Inventory").
In order to do this, we first should define what the structure (schema) of the different data sources are.
This will help us down the line to create modules that transform our raw data into our target data, for later export into a database, R package, or any other tools for utilising the data in a highly structured and annotated format.

How can I contribute?

We need to go through each of the datasources that we have defined in "Campaign Lab Data Inventory", create a transformer (in the transformers folder), and associated schema for each datasource.
The transformer should be able to be run in a machine locally, downloading the data and transforming it into a CSV (later importing it into a local database).
To contribute:
1. Open an Issue with the name of the issue formatted as description-rowIdentifier, where description and rowIdentifier are what is in the excel spreadsheet "Campaign Lab Data Inventory".
1. Write a small description of which dataset you are trying to transform and create a schema for.
1. Open a Pull Request (create a branch with an appropriate name) when you're finished

Formatting

We need to make sure we format similar fields between datasources in the same way.
For now, the standardization should follow:
Timestamp fields: 2015-06-30T22:30:00.000Z

What is a schema?

A schema in this case is basically just a JSON (JavaScript Object Notation) that describes the structure and format of the dataset.
an example schema would be

  "title": "Election results",
  "source": "https://data.police.uk/docs/method/forces/"
  "description": "A dataset of election results",
  "properties": {
    "county": {
      "type": ["string"],
      "description": "The county in which the result was"
    },
    "number_of_votes": {
      "type": ["integer"],
      "description": "The number of votes that were received"
    },
    "party": {
      "type": ["string"],
      "description": "the party which was receiving votes"
    }
  }
}

The title tells you the name of the dataset (you can make this up)
source is a link (if available) to the actual dataset.
The description is a one liner that describes the dataset
properties is a list of the datapoints that we want to end up with after transforming the raw dataset.

JohnField/data-pipeline

Campaign Lab Data Pipeline

What?

How can I contribute?

Formatting

What is a schema?