Goodtables can't handle S3 objects with spaces loaded from datapackage
Closed this issue · 4 comments
cschloer commented
Overview
Try to run Goodtables through a datapackage, where the resources paths are S3 objects. I'm consistently running into issues when there are spaces in the object key.
Example datapackage:
{
'bcodmo:': {
'dataManager': {
'name': '',
'orcid': ''
},
'submissionId': '1814285110495019634'
},
'bytes': 812,
'count_of_rows': 8,
'dump_bucket': 'laminar-results',
'dump_path': '41a4d656-191e-11eb-b364-b587e408effe/',
'hash': 'f002329fe5c4081b7271da7cfee9729e',
'profile': 'data-package',
'resources': [{
'bytes': 812,
'dialect': {
'delimiter': ',',
'doubleQuote': True,
'lineTerminator': '\r\n',
'quoteChar': '"',
'skipInitialSpace': False
},
'dpp:streaming': True,
'encoding': 'utf-8',
'format': 'csv',
'hash': 'dc065f16af38d3d78b41b523e67c07b4',
'mediatype': 'text/csv',
'name': 'new resource',
'path': 's3://laminar-results/41a4d656-191e-11eb-b364-b587e408effe/new resource.csv',
'profile': 'data-resource',
'schema': {
'fields': [{
'format': 'default',
'name': 'col1',
'type': 'string'
}, {
'format': 'default',
'name': 'col2',
'type': 'string'
}, {
'format': 'default',
'name': 'col3',
'type': 'string'
}, {
'format': 'default',
'name': 'col4',
'type': 'string'
}, {
'name': 'path_name',
'type': 'string'
}],
'missingValues': ['', 'nd']
}
}]
}
It does not work if a url encode the URL (using the tabulator.helpers functions 'requote_uri').
It does work if I change the object name to new_resource.csv.
Please preserve this line to notify @roll (lead of this repository)
cschloer commented
To be clear the resulting report looks like:
{
"report": {
"time": 0.843,
"valid": false,
"error-count": 1,
"table-count": 1,
"tables": [
{
"datapackage": "{\"bcodmo:\": {\"dataManager\": {\"name\": \"\", \"orcid\": \"\"}, \"submissionId\": \"1814285110495019634\"}, \"bytes\": 812, \"count_of_rows\": 8, \"dump_bucket\": \"laminar-results\", \"dump_path\": \"41a4d656-191e-11eb-b364-b587e408effe/\", \"hash\": \"f002329fe5c4081b7271da7cfee9729e\", \"profile\": \"data-package\", \"resources\": [{\"bytes\": 812, \"dialect\": {\"delimiter\": \",\", \"doubleQuote\": true, \"lineTerminator\": \"\\r\\n\", \"quoteChar\": \"\\\"\", \"skipInitialSpace\": false}, \"dpp:streaming\": true, \"encoding\": \"utf-8\", \"format\": \"csv\", \"hash\": \"dc065f16af38d3d78b41b523e67c07b4\", \"mediatype\": \"text/csv\", \"name\": \"new resource\", \"path\": \"s3://laminar-results/41a4d656-191e-11eb-b364-b587e408effe/new%20resource.csv\", \"profile\": \"data-resource\", \"schema\": {\"fields\": [{\"format\": \"default\", \"name\": \"col1\", \"type\": \"string\"}, {\"format\": \"default\", \"name\": \"col2\", \"type\": \"string\"}, {\"format\": \"default\", \"name\": \"col3\", \"type\": \"string\"}, {\"format\": \"default\", \"name\": \"col4\", \"type\": \"string\"}, {\"name\": \"path_name\", \"type\": \"string\"}], \"missingValues\": [\"\", \"nd\"]}}]}",
"resource-name": "new resource",
"time": 0.832,
"valid": false,
"error-count": 1,
"row-count": 0,
"source": "s3://laminar-results/41a4d656-191e-11eb-b364-b587e408effe/new resource.csv",
"scheme": "inline",
"format": "inline",
"encoding": "no",
"schema": "table-schema",
"errors": [
{
"code": "source-error",
"message": "An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.",
"message-data": {}
}
]
}
],
"warnings": [],
"preset": "datapackage"
}
}
cschloer commented
Sorry, this is actually a goodtables-py repo issue, not goodtables-ui. @roll how do you want me to handle this? I haven't migrated to frictionless-py (need to wait for dataflows to update dependency), but the goodtables-py repo no longer exists.