A prototype web service for validating and collecting participant files from eligible training providers
This package uses a suite of table validation tools based upon Good Tables.
The heart of this package is the JSON Table Schema that describes the required fields and their constraints. A draft specification is described within the etp-uploader-schema.json.
TODO: It would be nice to use the description field when displaying validation errors in order to give more effective feedback to the users when their data file isn't valid.
The ETP Uploader package uses Frictionless Data's Goodtables Python package to do its heavy lifting. That library recently made some breaking API upgrade to version 1.0, but this package still uses the older 0.7.x API. In the future it will be beneficial to migrate to the newer API, but for now we are using a custom fork to provide ongoing support for the 0.7.x API.
Currently runs on Python 3.6.1. Some historic support for 2.7, but development is conducted on Python 3.6+.
- Clone the repository into a virtual environment
- Install dependencies:
pip install -r requirements.txt
- Run a server:
python main.py
- Run the tests:
pip install -r requirements/test.txt && pip install -r requirements/local.txt
./test.sh
- A web form for manually adding data for validation
- Via XHR, a JSON object with an
endpoints
property that describes the available endpoints - Via browser, a documentation page for the API
- POST to validate data
- GET to validate data
The API and UI support a subset of all parameters available in a Good Tables pipeline.
All possible arguments to a pipeline and individual processors can be found in the Good Tables docs.
data
: (required) Any file, URL to a file, or string of dataschema
: (default. None) This is a convenience for theoptions['schema']['schema']
argument that is passed to the schema validatorreport_limit
: (default. 1000, max. 1000) An integer that sets a limit on the amount of report results a validator can generate. Validation will cease of this amount is reachedrow_limit
: (default. 20000, max. 30000) An integer that sets a limit on the amount of rows that will be processed. Iteration over the data will stop at this amount.fail_fast
: (default True) A boolean to set whether the run will fail on first error, or not.format
: (default 'csv') 'csv' or 'excel' - the format of the file.ignore_empty_rows
: (default False) A boolean to set whether empty rows should raise errors, or be ignored.ignore_duplicate_rows
: (default False) A boolean to set whether duplicate rows should raise errors, or be ignored.encoding
: (default None) A string that indicates the encoding of the data. Overrides automatic encoding detection.
# make a request
curl http://goodtables.okfnlabs.org/api/run --data "data=https://raw.githubusercontent.com/okfn/goodtables/master/examples/row_limit_structure.csv&schema=https://raw.githubusercontent.com/okfn/goodtables/master/examples/test_schema.json"
# the response will be like
{
"report": {
"summary": {
"bad_row_count": 1,
"total_row_count": 10,
...
},
"results": [
{
"result_id": "structure_001", # the ID of this result type
"result_level": "error", # the severity of this result type (info/warning/error)
"result_message": "Row 1 is defective: there are more cells than headers", # a message that describes the result
"result_name": "Defective Row", # a human-readable title for this result
"result_context": ['38', 'John', '', ''], # the row values from which this result triggered
"row_index": 1, # the idnex of the row
"row_name": "", # If the row has an id field, this is displayed, otherwise empty
"column_index": 4, # the index of the column
"column_name": "" # the name of the column (the header), if applicable
},
...
]
}
}
The UI is a simple form to add data, with an option schema, from either URLs or uploaded files.