Import records should not have an empty string for `mime_type`
Closed this issue · 2 comments
Import records that have an empty string for source_metadata.mime_type
don’t import correctly, even though we do occasionally create them. For example, a recent import had the following record:
{
"page_url": "https://www.niehs.nih.gov/sitemap/",
"uuid": null,
"capture_time": "2020-08-07T09:15:37Z",
"uri": "http://web.archive.org/web/20200807091537id_/https://www.niehs.nih.gov/sitemap/",
"hash": "db278b306b8e6380a2d27ca290b5b25f5e1969da9d9a472aabc26ddf07a6bbe0",
"source_type": "internet_archive",
"title": "404 Not Found",
"source_metadata": {
"mime_type": "",
"encoding": null,
"headers": {
"Access-Control-Allow-Origin": "*",
"Strict-Transport-Security": "max-age=31536000;includeSubDomains"
},
"view_url": "http://web.archive.org/web/20200807091537/https://www.niehs.nih.gov/sitemap/",
"error_code": 404
},
"status": "404",
"page_maintainers": [],
"page_tags": []
}
Which produced the error:
Row 72: Media type must be a media type, like `text/plain`, and *not* include parameters, like `; charset=utf-8`
Instead, we should make sure we set mime_type
to None
in this case. That’s probably in cli.py
:
web-monitoring-processing/web_monitoring/cli/cli.py
Lines 341 to 346 in be2a017
Relatedly, we should probably set media_type
and media_type_parameters
instead of source_metadata.mime_type
and source_metadata.encoding
on the import records. See how imports are handled here: https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/a22ca00993fb1dcc7ce6eaddfb3dbdef89cdaa32/app/jobs/import_versions_job.rb#L136-L139