HTTPArchive/data-pipeline

In future trim down custom metrics from `payload`?

tunetheweb opened this issue · 4 comments

As part of the effort to reduce the payload since by deduplicating data we remove the _custom field (amongst others) when saving this data to all tables:

https://github.com/HTTPArchive/wptagent/blob/e4546673d3b658022afb3885885e696290da53c5/HTTPArchive/httparchive.py#L425-L427

    # Remove the fields that are parsed out into separate columns
    page.pop("_parsed_css", None)
    page.pop("_custom", None)
    page.pop("_lighthouse", None)
    ...

However, that is only a list of the custom metrics:

"_custom": [
        "00_reset",
        "Colordepth",
        "Dpi",
...
        "usertiming",
        "valid-head",
        "well-known",
        "wpt_bodies"
    ],
    "_00_reset": null,
    "_Colordepth": 24,
    ...etc

The more weighty parts are the actual custom metrics beaneath this (_00_reset, _Colordepth...etc), some of which are quite large.

So we should enhance this to remove those too to save a lot of weight.

However, for now, them being in there is useful for the legacy tables (since there is no legacy custom metrics table) so leave for now. But filing this issue for when we move off of legacy so we don't forget.

Code has been added but commented out. Just need to remove the comment block when we're ready to remove them.

FWIW, it's probably a fairly substantial size. the rendered_html metric in particular (as well as the CSS ones) can be multiple megabytes.

Yeah exactly!

We could still inject them into the payload when populating the pages table for legacy. Or just live without them as part of forcing people over?

I’ve updated all the httparchvie queries not to look at custom metrics in the legacy pages table so all good to go from our end. The Web Almanac queries would need to be migrated but hoping the analysts this year will do a lot of those.

As we'll reprocess pages in HTTPArchive/dataform#8, let include this cleanup there too.