frictionlessdata/frictionless-py

Goodtables can't handle S3 objects with spaces loaded from datapackage

Closed this issue · 4 comments

Overview

Try to run Goodtables through a datapackage, where the resources paths are S3 objects. I'm consistently running into issues when there are spaces in the object key.

Example datapackage:

{
    'bcodmo:': {
        'dataManager': {
            'name': '',
            'orcid': ''
        },
        'submissionId': '1814285110495019634'
    },
    'bytes': 812,
    'count_of_rows': 8,
    'dump_bucket': 'laminar-results',
    'dump_path': '41a4d656-191e-11eb-b364-b587e408effe/',
    'hash': 'f002329fe5c4081b7271da7cfee9729e',
    'profile': 'data-package',
    'resources': [{
        'bytes': 812,
        'dialect': {
            'delimiter': ',',
            'doubleQuote': True,
            'lineTerminator': '\r\n',
            'quoteChar': '"',
            'skipInitialSpace': False
        },
        'dpp:streaming': True,
        'encoding': 'utf-8',
        'format': 'csv',
        'hash': 'dc065f16af38d3d78b41b523e67c07b4',
        'mediatype': 'text/csv',
        'name': 'new resource',
        'path': 's3://laminar-results/41a4d656-191e-11eb-b364-b587e408effe/new resource.csv',
        'profile': 'data-resource',
        'schema': {
            'fields': [{
                'format': 'default',
                'name': 'col1',
                'type': 'string'
            }, {
                'format': 'default',
                'name': 'col2',
                'type': 'string'
            }, {
                'format': 'default',
                'name': 'col3',
                'type': 'string'
            }, {
                'format': 'default',
                'name': 'col4',
                'type': 'string'
            }, {
                'name': 'path_name',
                'type': 'string'
            }],
            'missingValues': ['', 'nd']
        }
    }]
}

It does not work if a url encode the URL (using the tabulator.helpers functions 'requote_uri').

It does work if I change the object name to new_resource.csv.


Please preserve this line to notify @roll (lead of this repository)

To be clear the resulting report looks like:

{
	"report": {
		"time": 0.843,
		"valid": false,
		"error-count": 1,
		"table-count": 1,
		"tables": [
			{
				"datapackage": "{\"bcodmo:\": {\"dataManager\": {\"name\": \"\", \"orcid\": \"\"}, \"submissionId\": \"1814285110495019634\"}, \"bytes\": 812, \"count_of_rows\": 8, \"dump_bucket\": \"laminar-results\", \"dump_path\": \"41a4d656-191e-11eb-b364-b587e408effe/\", \"hash\": \"f002329fe5c4081b7271da7cfee9729e\", \"profile\": \"data-package\", \"resources\": [{\"bytes\": 812, \"dialect\": {\"delimiter\": \",\", \"doubleQuote\": true, \"lineTerminator\": \"\\r\\n\", \"quoteChar\": \"\\\"\", \"skipInitialSpace\": false}, \"dpp:streaming\": true, \"encoding\": \"utf-8\", \"format\": \"csv\", \"hash\": \"dc065f16af38d3d78b41b523e67c07b4\", \"mediatype\": \"text/csv\", \"name\": \"new resource\", \"path\": \"s3://laminar-results/41a4d656-191e-11eb-b364-b587e408effe/new%20resource.csv\", \"profile\": \"data-resource\", \"schema\": {\"fields\": [{\"format\": \"default\", \"name\": \"col1\", \"type\": \"string\"}, {\"format\": \"default\", \"name\": \"col2\", \"type\": \"string\"}, {\"format\": \"default\", \"name\": \"col3\", \"type\": \"string\"}, {\"format\": \"default\", \"name\": \"col4\", \"type\": \"string\"}, {\"name\": \"path_name\", \"type\": \"string\"}], \"missingValues\": [\"\", \"nd\"]}}]}",
				"resource-name": "new resource",
				"time": 0.832,
				"valid": false,
				"error-count": 1,
				"row-count": 0,
				"source": "s3://laminar-results/41a4d656-191e-11eb-b364-b587e408effe/new resource.csv",
				"scheme": "inline",
				"format": "inline",
				"encoding": "no",
				"schema": "table-schema",
				"errors": [
					{
						"code": "source-error",
						"message": "An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.",
						"message-data": {}
					}
				]
			}
		],
		"warnings": [],
		"preset": "datapackage"
	}
}

Sorry, this is actually a goodtables-py repo issue, not goodtables-ui. @roll how do you want me to handle this? I haven't migrated to frictionless-py (need to wait for dataflows to update dependency), but the goodtables-py repo no longer exists.

roll commented

Hi @cschloer,

No worries I'll handle this one.

For the future feel free to drop all goodtables issues to the frictionless-py issue tracker as goodtables lives there as a branch

roll commented

Hi @cschloer,

Could you please try updating to tabulator@1.52.5