edgi-govdata-archiving/web-monitoring-processing

Import records should not have an empty string for `mime_type`

Closed this issue · 2 comments

Import records that have an empty string for source_metadata.mime_type don’t import correctly, even though we do occasionally create them. For example, a recent import had the following record:

{
    "page_url": "https://www.niehs.nih.gov/sitemap/",
    "uuid": null,
    "capture_time": "2020-08-07T09:15:37Z",
    "uri": "http://web.archive.org/web/20200807091537id_/https://www.niehs.nih.gov/sitemap/",
    "hash": "db278b306b8e6380a2d27ca290b5b25f5e1969da9d9a472aabc26ddf07a6bbe0",
    "source_type": "internet_archive",
    "title": "404 Not Found",
    "source_metadata": {
        "mime_type": "",
        "encoding": null,
        "headers": {
            "Access-Control-Allow-Origin": "*",
            "Strict-Transport-Security": "max-age=31536000;includeSubDomains"
        },
        "view_url": "http://web.archive.org/web/20200807091537/https://www.niehs.nih.gov/sitemap/",
        "error_code": 404
    },
    "status": "404",
    "page_maintainers": [],
    "page_tags": []
}

Which produced the error:

Row 72: Media type must be a media type, like `text/plain`, and *not* include parameters, like `; charset=utf-8`

Instead, we should make sure we set mime_type to None in this case. That’s probably in cli.py:

metadata = {
'mime_type': memento.headers.get('content-type', '').split(';', 1)[0],
'encoding': memento.encoding,
'headers': original_headers,
'view_url': cdx_record.view_url
}

Relatedly, we should probably set media_type and media_type_parameters instead of source_metadata.mime_type and source_metadata.encoding on the import records. See how imports are handled here: https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/a22ca00993fb1dcc7ce6eaddfb3dbdef89cdaa32/app/jobs/import_versions_job.rb#L136-L139

This was fixed in #621.