usgpo/bill-status

Incomplete XML for H.R. 5367

LonelySpaceman opened this issue · 2 comments

Hey y'all! This database is fantastic, I'm so grateful that all this information is in the same place.

The XML link for H.R. 5367 was taking forever to load (probably due to its size), so I threw together a quick python script to force the link's contents into a .html file and .xml file:

import requests
testHtml = requests.get('https://www.govinfo.gov/bulkdata/BILLS/117/1/hr/BILLS-117hr5376rh.xml')

fileName = 'Holder.html' #<--- replace w/ 'Holder.xml' for .xml file 
with open(fileName, 'w', encoding = 'ISO-8859-1') as fileObject:
        fileObject.write(testHtml.text)

The resulting Holder.xml file refuses to open, saying "Problems with XSL transform 'billres.xsl' prevent it from being applied to this XML file." I can use my browser to open Holder.html, but when I do it only contains the bill header and sections subsequent to section 130002. In addition, the formatting seems to be completely messed up: lines that are supposed to be separate bleed into each other, and indentations are completely missing. Am I doing something wrong here?

aih commented

Inside these xml files is a link to an xslt transform ('billres.xsl') that is meant to convert the XML to HTML in your browser. This link will do things you don't expect, unless you've been working with XML for a while.

If your goal is to open just one of the files, as XML, you should go to:

view-source:https://www.govinfo.gov/bulkdata/BILLS/117/1/hr/BILLS-117hr5376rh.xml

This will take about 10 seconds to load on a reasonable internet connection. The main content is all on one (very long) line:

image

In Chrome, I get the option to 'Line Wrap':
image

Once this is loaded in your browser, you can 'Save Page As (cmd+S)' from the File menu.

Alternately, your code, with small changes, works fine to download the XML. But it will not download the HTML, since that is generated dynamically in the browser:

>>> import requests
>>> testXML = requests.get('https://www.govinfo.gov/bulkdata/BILLS/117/1/hr/BILLS-117hr5376rh.xml')
>>> with open('myfile.xml', 'w') as f:
...     f.write(testXML.text)

You can open myfile.xml in an XML editor (e.g. Oxygen)

If you want to see the html, which includes all of the styling, you would need to load the page directly and wait, possibly 5-10 minutes. (https://www.govinfo.gov/bulkdata/BILLS/117/1/hr/BILLS-117hr5376rh.xml). As it says on congress.gov:

https://www.congress.gov/bill/117th-congress/house-bill/5376/text

This text has been loaded in plain text format due to the large size of the XML/HTML file. Loading the XML/HTML in a new window (4MB) may take several minutes or possibly cause your browser to become unresponsive.

Thank you, @aih!
@LonelySpaceman, I am closing the issue but I can reopen it if you need additional assistance.