This repository is a starting point for developers within Open Data Services Cooperative to either pickup and run with, or use as a verbose example of how not to solve this problem.
The problem has been outlined by @timgdavies in a gist which I've reproduced below. All I've done so far is create a skeletal outline of how I'd approach this, and funtionality for the '*' wildcard column headings. The [key:value] constants haven't been worked through yet, and I've peppered the code with # TODO:
comments where I think there is more functionality needed (for instance, validation of workbooks before attempting to explode their special columns.
Any questions, let me know. I'm going to assume that this repo won't be needed for more than a few weeks, so once the relevant code has been copied over (or pointed and laughted at), I'll remove it from GH.
This document contains preliminary remarks on modelling indicators in spreadsheets and schema to support the work of the Global Coffee Platform and Social Economy Data Lab project.
We want to capture data about indicators.
An indicator consists of a measure and a set of dimensions.
Indicators attach to entities like organisations, localities or investments.
Indicators may have particular data types or validation rules related to them.
Indicators need to be clearly and accessiblity documented
An indicator might be "Number of staff".
We might state that Open Data Services Co-op has 14 staff.
A spreadsheet about organisations and their numbers of staff might represent this as:
Organisation | Number of staff |
---|---|
Open Data Services Co-op | 14 |
To make this work with flatten tool, we might expand this to:
org/name | org/id | org/indicators/0/value |
---|---|---|
Open Data Services Co-op | GB-COH-09506232 | 14 |
But, this leaves quite a lot of information implicit.
- What period does this measure cover?
- How is the measure defined?
The JSON representation of the above would be:
{
"org": {
"id":"GB-COH-09506232",
"name":"Open Data Services Co-op",
"indicators":[
{
"value":14
}
]
}
}
But we really need something at least like:
{
"org": {
"id":"GB-COH-09506232",
"name":"Open Data Services Co-op",
"indicators":[
{
"indicatorCode":"FTE",
"value":14,
"date":"2018-04-19"
}
]
}
}
Which would be OK in a spreadsheet where each indicator is it's own row, and we could have:
org/name | org/id | org/indicators/0/code | org/indicators/0/date | org/indicators/0/value |
---|---|---|---|---|
Open Data Services Co-op | GB-COH-09506232 | FTE | 2018-04-19 | 14 |
But in many of the spreadsheets we've been, users prefer to lay information out with a metric in each column, such as:
Organisation Name | Date measured | FTE Staff | Part-time staff (total) | Volunteers |
---|---|---|---|---|
Open Data Services Co-op | 2018-04-09 | 14 | 6 | 0 |
It would be good if we can maintain structures like this, whilst marking them up to generate clear structured data.
In doing so we have to be aware that:
- Users often create filters or sort spreadsheets, making hidden rows and positional markup difficult to maintain
We could create a pre-processor for flatten-tool which expands out the additional contextual columns for each indicator without requiring that the columns are included in the original.
This would support two additional syntaxes.
(1) Adding a constant value to every row
When [property:value] is included on the end of a field path in the header row, an additional column must be created that has a field path ending in 'property' and the value filled down in every row in which the source column has a value.
For example:
org/id | org/indicators/0/value[code:FTE] | org/indicators/1/value[code:Volunteers] |
---|---|---|
GB-COH-09506232 | 14 | 0 |
GB-COH-05381958 | 2 |
Would become:
org/id | org/indicators/0/code | org/indicators/0/value | org/indicators/1/code | org/indicators/1/value[code:Volunteers] |
---|---|---|---|---|
GB-COH-09506232 | FTE | 14 | Volunteers | 0 |
GB-COH-05381958 | Volunteers | 2 |
(Note the two new columns created, and filled where their paired column had a value).
(2) Adding a constant value to every column
When a column path exists with a wildcard in place of an array integer, a column must be created for each following integer that could be substituted in for the wildcard, processing across the table until the final such column OR until the same wildcard column is encountered again.
For example:
org/id | org/indicators/*/year | org/indicators/0/value | org/indicators/1/value |
---|---|---|---|
GB-COH-09506232 | 2017 | 14 | 0 |
GB-COH-05381958 | 2018 | 2 |
Would become:
org/id | org/indicators/0/year | org/indicators/0/value | org/indicators/1/year | org/indicators/1/value |
---|---|---|---|---|
GB-COH-09506232 | 2017 | 14 | 2017 | 0 |
GB-COH-05381958 | 2018 | 2 | 2018 |
(Note the year column, now given for each wildcard field)
The rule about only applying this expansion until such time as a second wildcard column is found is illustrated below:
org/id | org/indicators/*/year | org/indicators/0/value | org/indicators/1/value | org/indicators/*/year | org/indicators/2/value | org/indicators/*/year | org/indicators/3/value |
---|---|---|---|---|---|---|---|
GB-COH-09506232 | 2017 | 14 | 0 | - | 5 | 2015 | 1 |
GB-COH-05381958 | 2018 | 2 | - | 5 | 2016 | 2 |
Which would expand to:
org/id | org/indicators/0/year | org/indicators/0/value | org/indicators/1/year | org/indicators/1/value | org/indicators/2/value | org/indicators/3/year | org/indicators/3/value |
---|---|---|---|---|---|---|---|
GB-COH-09506232 | 2017 | 14 | 2017 | 0 | 5 | 2015 | 1 |
GB-COH-05381958 | 2018 | 2 | 2018 | 5 | 2016 | 2 |
Note here that the use of a blank (shown with '-') indicators/*/year
column after indicators/1/value
means no year column has been created for the indicators/2/value
, but then the next use of indicators/*/year
sets a new year for the columns that follow.
Putting it together
Combined, these two expansions should make it possible to generate rich data from terse spreadsheets that match how data is often organised. There are similarities here to the approach taken by HXL, but rather than indicating the concepts represented by columns, we are looking to support conversion into tree structured data.
For example, the table below:
org/name | org/id | org/indicators/*/year | org/indicators/0/value[code:FTE] | org/indicators/1/value[code:Volunteers] | org/indicators/2/value[code:Turnover] |
---|---|---|---|---|---|
Organisation Name | Organisation ID | Year | Employees (FTE) | Volunteers | Turnover |
Open Data Services Co-operative | GB-COH-09506232 | 2018 | 12 | 0 | 500000 |
Practical Participation Ltd | GB-COH-05381958 | 2017 | 2 | 1 | 100000 |
should expand to:
[
{
"org": {
"name": "Open Data Services Co-operative",
"id": "GB-COH-09506232",
"indicators": [
{
"year": "2018",
"code": "FTE",
"value": "12"
},
{
"code": "Volunteers",
"year": "2018",
"value": "0"
},
{
"year": "2018",
"code": "Turnover",
"value": "500000"
}
]
}
},
{
"org": {
"name": "Practical Participation Ltd",
"id": "GB-COH-05381958",
"indicators": [
{
"year": "2017",
"code": "FTE",
"value": "2"
},
{
"code": "Volunteers",
"year": "2017",
"value": "1"
},
{
"year": "2017",
"code": "Turnover",
"value": "100000"
}
]
}
}
]
The second challenge we face relates to indicator validation and documentation.
For example, we might know that:
- FTE Staff indicators should be given by a number, and should have period value;
- A turnover indicaitor should be a number and that it should also have a currency value specified;
We therefore need a way to:
- Represent;
- Document; and
- Validate
requirements for indicators?
JSON Schema offers us an approach to this, allowing us to create a small block of schema for each indicator. For example:
FTE Employees Indicator
{
"type":"object",
"title":"Employees (FTE)",
"description":"Number of employees, full-time equivalent.",
"properties":{
"code":{
"type":"string",
"enum":["FTE"]
},
"date":{
"type":"string",
"title":"Period",
"description":"The date may be provided as YYYY, YYYY-MM, or YYYY-MM-DD"
},
"value":{
"type":"number"
}
}
}
Turnover indicator
{
"type":"object",
"title":"Turnover",
"description":"Turnover in the calendar year starting on the given date.",
"properties":{
"code":{
"type":"string",
"enum":["Turnover"]
},
"date":{
"type":"string",
"title":"Start date of 12 month period",
"description":"The date may be provided as YYYY, YYYY-MM, or YYYY-MM-DD"
},
"value":{
"type":"number"
},
"currency":{
"type":"string",
"title":"Currency",
"description":"An ISO Currency String",
"enum":["USD","GBP","EUR"]
},
"required":["code","value","currency","date"]
}
}
These can be combined into an overall JSON schema using the anyOf
property.
Adam Locker at the Food Standards Agency has written a useful primer on documenting metrics