usgpo/uslm

Extract USLM to database tables

Opened this issue · 4 comments

Hello! I am working to extract data from several collections from this location: https://www.govinfo.gov/app/collection/comps/w to database tables. I'm working to store key content for each law in one table as follows:

section identifier
section – subsection identifier
section – subsection – paragraph identifier
section – subsection – paragraph - subparagraph identifier
section – subsection – paragraph - subparagraph – clause identifier

section num
section – subsection num
section – subsection – paragraph num
section – subsection – paragraph - subparagraph num
section – subsection – paragraph - subparagraph – clause num

etc etc.

I am using open refine to extract data to css, then load to database. open refine does an ok job, but it a) does not seem able to get all content b) does not deal well with large files.

I wonder if I am reinventing the wheel here, and if the USLM team has any ideas for best practice extracting content from the xml.

thank you, and thanks for the good work!

-Joel

Hi Joel,

Files are also available from the GovInfo Bulk Data Repository at https://www.govinfo.gov/bulkdata/COMPS.

Also passing your questions to others on the USLM team.

Thanks,
Lisa

Hi Joel,

Thank you for your question.

Because XML has a fundamentally different architecture from relational databases (hierarchical versus tabular), mapping between the two is never easy. Relational database companies have tackled this problem through XML-specific databases (eg. "Oracle XML DB").

Regarding your specific question of how to extract the content of a given XML element (e.g. a given subparagraph) - it is unclear to us if OpenRefine will meet your needs. You might try the parseXml() function on the whole document, then select() the portion you need. But it might may not perform as well as you need.

A 'best practice' solution would be to adopt XML tool(s) that are able to use XQuery to extract information from the XML content set (https://en.wikipedia.org/wiki/XQuery). Tools designed for the XML data model are better suited to reliably extracting data from XML files of any size, than tools which were not designed for XML.

USLM team - we built a good-but-not-perfect parser for USLM to create database tables from text fields in the USLM schema. The USLM schema documentation and consistency of the xml made this possible. Happy to share it back to the community.
Screenshot 2023-11-16 at 11 12 29 AM