GSA/site-scanning

Process new meta tag scans

Closed this issue · 3 comments

Note - gutcheck the methodologies of ones that have 0 results

  • Share with people who contributed to the list building
  • Add in percents of currently scanned for tags

And then make the calls on what to keep and how.

Stats:

  • 28624 Target URLs
  • 15,133 are live.
  • There are 1,006 tags detected on the other 13,491 non-Live Target URLs

Compare the lower table against these, the numbers for current tag scans...

Field Number of the 15031 sites that have a value for it %
title 14466 96.24%
description 4615 30.70%
og_title 3854 25.64%
og_description 3018 20.08%
og_article_published 260 1.73%
og_article_modified 683 4.54%
canonical_link 4677 31.12%
viewport_meta_tag 10929 72.71%
main_element_present 5383 35.81%

[Legend for below: *=search.gov recommended; +=civichackingagency recommended; ^=open graph; ~=dublincore; #=schema.org

Tag Number of the 15133 sites that use it %
meta_keywords_content* 1733 11.45%
meta_robots_content*+ 2702 17.86%
meta_article_section_content*^ 0 0.00%
meta_article_tag_content*^ 3 0.02%
og_image_final_url*+^ 3198 21.13%
dcterms_keywords_content*~ 0 0.00%
dc_subject_content*~ 41 0.27%
dcterms_subject_content*~ 369 2.44%
dcterms_audience_content*~ 48 0.32%
dc_type_content*~ 41 0.27%
dcterms_type_content*~ 507 3.35%
dc_date_content*~ 4 0.03%
dc_date_created_content*~ 21 0.14%
dcterms_created_content*~ 297 1.96%
og_locale_content+ 1293 8.54%
og_site_name_content+ 2962 19.57%
og_type_content+ 3094 20.45%
og_url_content+ 3483 23.02%
og_image_alt_content+ 690 4.56%
revised_content 11 0.07%
last_modified_content 32 0.21%
language_content 205 1.35%
date_content 16 0.11%
subject_content 64 0.42%
owner_content 26 0.17%
pagename_content 0 0.00%
dc_title_content~ 158 1.04%
og_site_name^ 29 0.19%
item_type_content# 682 4.51%
item_scope_content# 176 1.16%
item_prop_content# 859 5.68%
vocab_content# 8 0.05%
type_of_content# 969 6.40%
property_content# 4979 32.90%
context_content# 31 0.20%
type_content# 12271 81.09%
html_lang_content 12181 80.49%
href_lang_content 775 5.12%
me_content 0 0.00%

Here's our proposal of what to keep, what to not keep, and why.

Keep:

  • <meta name='keywords' - Decent adoption and useful content indicators
  • <meta property="og:image" - Important component of OG implementation
  • <meta property="og:type" - Important component of OG implementation
  • <meta property="og:url" - Important component of OG implementation
  • <html lang= - Important data for multilingual content analysis
  • <link hreflang= - Important data for multilingual content analysis

Don't keep:

  • <meta name="robots" - Content isn't currently actionable and if we wanted to pursue it, we should first do more with the robots.txt
  • <meta name="article:section" - Low adoption by agencies
  • <meta name="article:tag" - Low adoption by agencies
  • <meta name="dcterms.keywords" - Low adoption by agencies
  • <meta name="dc.subject" - Low adoption by agencies
  • <meta name="dcterms.subject" - Low adoption by agencies
  • <meta name="dcterms.audience" - Low adoption by agencies
  • <meta name="dc.type" - Low adoption by agencies
  • <meta name="dcterms.type" - Low adoption by agencies
  • <meta name="dc.date" - Low adoption by agencies
  • <meta name="dc.date.created" - Low adoption by agencies
  • <meta name="dcterms.created" - Low adoption by agencies
  • <meta property="og:locale" - Decent adoption but almost universally just an indicator of EN or EN_US. Only 4 records are otherwise.
  • <meta property="og:site_name" - Good adoption but not clear that it's central to OG implementation and the information appears likely to be duplicative with e.g. page title.
  • <meta property="og:image:alt" - Low adoption by agency and not clear that it's necessarily an important component of accessibility
  • <meta name="revised" - Low adoption by agency
  • <meta http-equiv=”last-modified” - Low adoption by agency
  • <meta name='language' - Low adoption and almost universally just an indicator of EN or EN_US. Only 3 records are otherwise.
  • <meta name='date' - Low adoption by agency
  • <meta name='subject' - Low adoption by agency
  • <meta name='owner' - Low adoption by agency
  • <meta name='pagename' - Low adoption by agency
  • <meta name='DC.title' - Low adoption by agency
  • <meta name='og:site_name' - Low adoption by agency
  • <link rel="me" - Low adoption by agency
  • itemtype="" - Mainly scanned to determine schema.org adoption levels
  • itemscope="" - Mainly scanned to determine schema.org adoption levels
  • itemprop="" - Mainly scanned to determine schema.org adoption levels
  • vocab="" - Mainly scanned to determine schema.org adoption levels
  • typeof="" - Mainly scanned to determine schema.org adoption levels
  • property="" - Mainly scanned to determine schema.org adoption levels
  • context="" - Mainly scanned to determine schema.org adoption levels
  • type="" - Mainly scanned to determine schema.org adoption levels

This is done. I need to add links to this data in the documentation before closing it though.