Treasury Datapackage Pipelines

Pipelines used to prepare data for upload to OpenSpending Packager while we're not ready to use os-data-importers but need datapackage-pipelines

Setup

If not using docker, install Python dependencies:

Set up and activate a python 3 virtual environment.

Install the dependencies in the virtual environment using pip install -r requirements.txt

Running pipelines

List available pipelines with

dpp

or using docker

docker run --rm -it -v `pwd`:/pipelines:rw frictionlessdata/datapackage-pipelines

run a pipeline with

DPP_PROCESSOR_PATH=$PWD dpp run --verbose ./2018-19/national/aene/aene-2018-19

or using docker

docker run --rm -it -e "DPP_PROCESSOR_PATH=/pipelines" -v `pwd`:/pipelines:rw frictionlessdata/datapackage-pipelines run --verbose ./2018-19/national/aene/aene-2018-19

Note important options:

environment variable DPP_PROCESSOR_PATH - help it find out processors
argument --verbose: Actually give is some output so we know where things break

Unique dimensions

OpenSpending relies on each row ignoring amounts having a unique set of dimension values.

In database terms, the composite primary key for each row, made up of each of the classification columns, must be unique.

To check if your dataset has unique rows or needs additional processing to make it unique, use csvkit. Install it in a different python virtualenv from datapackage-pipelines.

First list the fields to get their indexes:

csvcut -n 2017-18/national/processed/aene-2017-18.csv

Then count the number of rows for each combination of classifying fields by selecting all fields except the amount field, and counting the duplicate rows. If there are duplicate fields, the last rows in the output of the following command would be more than 1:

csvcut -C 14 2017-18/national/processed/aene-2017-18.csv | sort | uniq -c| sort -n

e.g.

14 24,"Agriculture, Forestry and Fisheries",4,Trade Promotion and Market Access,2,International Relations and Trade,Current,Transfers and subsidies,Foreign governments and international organisations,Foreign governments and international organisations,Foreign governments and international organisations,2017,Total,Adjusted appropriation
14 24,"Agriculture, Forestry and Fisheries",4,Trade Promotion and Market Access,2,International Relations and Trade,Current,Transfers and subsidies,Foreign governments and international organisations,Foreign governments and international organisations,Foreign governments and international organisations,2017,Total,Voted (Main appropriation)
24 6,International Relations and Cooperation,5,International Transfers,2,Membership contribution,Current,Transfers and subsidies,Foreign governments and international organisations,Foreign governments and international organisations,Foreign governments and international organisations,2017,Total,Adjusted appropriation
24 6,International Relations and Cooperation,5,International Transfers,2,Membership contribution,Current,Transfers and subsidies,Foreign governments and international organisations,Foreign governments and international organisations,Foreign governments and international organisations,2017,Total,Voted (Main appropriation)

In this case you should verify that the kind of duplication that's happening can be solved by summing all the duplicates, and adding the join processor's deduplication mode.

Troubleshooting:

dpp output like this ('NoneType' object has no attribute 'startswith' in particular) sometimes means a spec section is indented differently to the rest. This often happens when copying between specs.

- ./2016-17/provincial/are/are-2016-17
- ./2014-15/national/ene/ene-2014-15 (*)
- ./budget-vs-actual/national/budget-vs-actual-national (*)
Traceback (most recent call last):
  File "/home/jdb/projects/vulekamali/treasury-pipelines/env/bin/dpp", line 11, in <module>
    load_entry_point('datapackage-pipelines==2.0.0', 'console_scripts', 'dpp')()
  File "/home/jdb/projects/vulekamali/treasury-pipelines/env/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/jdb/projects/vulekamali/treasury-pipelines/env/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/jdb/projects/vulekamali/treasury-pipelines/env/lib/python3.7/site-packages/click/core.py", line 1114, in invoke
    return Command.invoke(self, ctx)
  File "/home/jdb/projects/vulekamali/treasury-pipelines/env/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/jdb/projects/vulekamali/treasury-pipelines/env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/jdb/projects/vulekamali/treasury-pipelines/env/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/jdb/projects/vulekamali/treasury-pipelines/env/lib/python3.7/site-packages/datapackage_pipelines/cli.py", line 20, in cli
    for spec in pipelines():  # type: PipelineSpec
  File "/home/jdb/projects/vulekamali/treasury-pipelines/env/lib/python3.7/site-packages/datapackage_pipelines/specs/specs.py", line 73, in pipelines
    for prefix in prefixes):
  File "/home/jdb/projects/vulekamali/treasury-pipelines/env/lib/python3.7/site-packages/datapackage_pipelines/specs/specs.py", line 73, in <genexpr>
    for prefix in prefixes):
AttributeError: 'NoneType' object has no attribute 'startswith'