unitedstates/congress

Exception: Unmatched bulk data file URL

Closed this issue · 5 comments

Good afternoon.

Have run into an error while running:

./run govinfo --bulkdata=BILLSTATUS

After an hour or so I receive this error:

Downloading: https://www.govinfo.gov/sitemap/bulkdata/BILLSTATUS/115hr/sitemap.xml
Traceback (most recent call last):

  File "./run", line 71, in <module>
    task_mod.run(options)

  File "/home/ubuntu/congress3/congress/tasks/govinfo.py", line 74, in run
    update_sitemap(BULKDATA_SITEMAPINDEX_PATTERN.format(collection=collection), None, [], options)

  File "/home/ubuntu/congress3/congress/tasks/govinfo.py", line 102, in update_sitemap
    return update_sitemap2(url, current_lastmod, how_we_got_here, options, lastmod_cache, cache_file)

  File "/home/ubuntu/congress3/congress/tasks/govinfo.py", line 155, in update_sitemap2
    sitemap_results = update_sitemap(url, lastmod, how_we_got_here, options)

  File "/home/ubuntu/congress3/congress/tasks/govinfo.py", line 102, in update_sitemap
    return update_sitemap2(url, current_lastmod, how_we_got_here, options, lastmod_cache, cache_file)

  File "/home/ubuntu/congress3/congress/tasks/govinfo.py", line 184, in update_sitemap2
    raise Exception("Unmatched bulk data file URL (%s) at %s." % (url, "->".join(how_we_got_here)))

Exception: Unmatched bulk data file URL (https://govinfo.gov/bulkdata/BILLSTATUS/115/hr/BILLSTATUS-115hr1.xml) at https://www.govinfo.gov/sitemap/bulkdata/BILLSTATUS/sitemapindex.xml->https://www.govinfo.gov/sitemap/bulkdata/BILLSTATUS/115hr/sitemap.xml.

System is Ubuntu 16.04, 8Gb ram, Python 2.7.12.

It seems those particular XML exist so I'm not sure what the issues is.

Thank you.

I've been getting that too. @usgpo has had some issues with sitemaps in the last week (usgpo/bulk-data#37) and it looks like this one got restored but URLs changed (cc @jonquandt at GPO). I'm not sure if it was an intentional change or not at GPO. Our sitemap reader could be less picky as well about the URL it sees... I'll post a fix in a moment.

Oh and the issue is that we're expecting the URL in the sitemap to have "www." before govinfo.giov and it disappeared in this file.

@JoshData -- thanks for the heads up. I see the issue and we'll get it sorted out.

@JoshData and @jacobcrp -- we've updated that sitemap to include the www.

Please let me know if that doesn't resolve that issue and we can look at it further.

All good now! 👍