Extract USLM to database tables

Question

Extract USLM to database tables

Opened this issue a year ago · 4 comments

Hello! I am working to extract data from several collections from this location: https://www.govinfo.gov/app/collection/comps/w to database tables. I'm working to store key content for each law in one table as follows:

section identifier
section – subsection identifier
section – subsection – paragraph identifier
section – subsection – paragraph - subparagraph identifier
section – subsection – paragraph - subparagraph – clause identifier

section num
section – subsection num
section – subsection – paragraph num
section – subsection – paragraph - subparagraph num
section – subsection – paragraph - subparagraph – clause num

etc etc.

I am using open refine to extract data to css, then load to database. open refine does an ok job, but it a) does not seem able to get all content b) does not deal well with large files.

I wonder if I am reinventing the wheel here, and if the USLM team has any ideas for best practice extracting content from the xml.

thank you, and thanks for the good work!

-Joel

Answer 1 · 2023-09-25T14:26:31.000Z

Hi Joel,

Files are also available from the GovInfo Bulk Data Repository at https://www.govinfo.gov/bulkdata/COMPS.

Also passing your questions to others on the USLM team.

Thanks,
Lisa

Answer 2 · 2023-09-26T00:19:40.000Z

Hi Joel,

Thank you for your question.

Because XML has a fundamentally different architecture from relational databases (hierarchical versus tabular), mapping between the two is never easy. Relational database companies have tackled this problem through XML-specific databases (eg. "Oracle XML DB").

Regarding your specific question of how to extract the content of a given XML element (e.g. a given subparagraph) - it is unclear to us if OpenRefine will meet your needs. You might try the parseXml() function on the whole document, then select() the portion you need. But it might may not perform as well as you need.

A 'best practice' solution would be to adopt XML tool(s) that are able to use XQuery to extract information from the XML content set (https://en.wikipedia.org/wiki/XQuery). Tools designed for the XML data model are better suited to reliably extracting data from XML files of any size, than tools which were not designed for XML.

Answer 3 · 2023-09-26T11:55:19.000Z

Thank you all so much for getting back to me! We are going to press on then, at least knowing we are at least not duplicating effort. Oracle xmldb is the next stop for sure after this proof of concept. Our ultimate goal is to be able to have a searchable database of full text of laws relevant to our mission (army corps of engineers civil works) that we can link to our appropriations and expenditures. If anyone is interested in this type of work, please feel free to forward my contact info. Thank you again for the awesome work at gpo!

…

-joel [Graphical user interface, text, application, email Description automatically generated] From: bradleechang ***@***.***> Reply-To: usgpo/uslm ***@***.***> Date: Monday, September 25, 2023 at 8:19 PM To: usgpo/uslm ***@***.***> Cc: "Schlagel, Joel D" ***@***.***>, Author ***@***.***> Subject: [EXTERNAL] Re: [usgpo/uslm] Extract USLM to database tables (Issue #31) This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding. Hi Joel, Thank you for your question. Because XML has a fundamentally different architecture from relational databases (hierarchical versus tabular), mapping between the two is never easy. Relational database companies have tackled this problem through XML-specific databases (eg. "Oracle XML DB"). Regarding your specific question of how to extract the content of a given XML element (e.g. a given subparagraph) - it is unclear to us if OpenRefine will meet your needs. You might try the parseXml() function on the whole document, then select() the portion you need. But it might may not perform as well as you need. A 'best practice' solution would be to adopt XML tool(s) that are able to use XQuery to extract information from the XML content set (https://en.wikipedia.org/wiki/XQuery). Tools designed for the XML data model are better suited to reliably extracting data from XML files of any size, than tools which were not designed for XML. — Reply to this email directly, view it on GitHub<#31 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BB77A6XJQDGMUJBGG3MQ4ILX4INSPANCNFSM6AAAAAA5GFWU4Y>. You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 4 · 2023-11-16T16:14:10.000Z

USLM team - we built a good-but-not-perfect parser for USLM to create database tables from text fields in the USLM schema. The USLM schema documentation and consistency of the xml made this possible. Happy to share it back to the community.