Exception: Unmatched bulk data file URL
Closed this issue · 5 comments
Good afternoon.
Have run into an error while running:
./run govinfo --bulkdata=BILLSTATUS
After an hour or so I receive this error:
Downloading: https://www.govinfo.gov/sitemap/bulkdata/BILLSTATUS/115hr/sitemap.xml
Traceback (most recent call last):
File "./run", line 71, in <module>
task_mod.run(options)
File "/home/ubuntu/congress3/congress/tasks/govinfo.py", line 74, in run
update_sitemap(BULKDATA_SITEMAPINDEX_PATTERN.format(collection=collection), None, [], options)
File "/home/ubuntu/congress3/congress/tasks/govinfo.py", line 102, in update_sitemap
return update_sitemap2(url, current_lastmod, how_we_got_here, options, lastmod_cache, cache_file)
File "/home/ubuntu/congress3/congress/tasks/govinfo.py", line 155, in update_sitemap2
sitemap_results = update_sitemap(url, lastmod, how_we_got_here, options)
File "/home/ubuntu/congress3/congress/tasks/govinfo.py", line 102, in update_sitemap
return update_sitemap2(url, current_lastmod, how_we_got_here, options, lastmod_cache, cache_file)
File "/home/ubuntu/congress3/congress/tasks/govinfo.py", line 184, in update_sitemap2
raise Exception("Unmatched bulk data file URL (%s) at %s." % (url, "->".join(how_we_got_here)))
Exception: Unmatched bulk data file URL (https://govinfo.gov/bulkdata/BILLSTATUS/115/hr/BILLSTATUS-115hr1.xml) at https://www.govinfo.gov/sitemap/bulkdata/BILLSTATUS/sitemapindex.xml->https://www.govinfo.gov/sitemap/bulkdata/BILLSTATUS/115hr/sitemap.xml.
System is Ubuntu 16.04, 8Gb ram, Python 2.7.12.
It seems those particular XML exist so I'm not sure what the issues is.
Thank you.
I've been getting that too. @usgpo has had some issues with sitemaps in the last week (usgpo/bulk-data#37) and it looks like this one got restored but URLs changed (cc @jonquandt at GPO). I'm not sure if it was an intentional change or not at GPO. Our sitemap reader could be less picky as well about the URL it sees... I'll post a fix in a moment.
Oh and the issue is that we're expecting the URL in the sitemap to have "www." before govinfo.giov and it disappeared in this file.
@JoshData -- thanks for the heads up. I see the issue and we'll get it sorted out.
All good now! 👍