Extract USLM to database tables
Opened this issue · 4 comments
Hello! I am working to extract data from several collections from this location: https://www.govinfo.gov/app/collection/comps/w to database tables. I'm working to store key content for each law in one table as follows:
section identifier
section – subsection identifier
section – subsection – paragraph identifier
section – subsection – paragraph - subparagraph identifier
section – subsection – paragraph - subparagraph – clause identifier
section num
section – subsection num
section – subsection – paragraph num
section – subsection – paragraph - subparagraph num
section – subsection – paragraph - subparagraph – clause num
etc etc.
I am using open refine to extract data to css, then load to database. open refine does an ok job, but it a) does not seem able to get all content b) does not deal well with large files.
I wonder if I am reinventing the wheel here, and if the USLM team has any ideas for best practice extracting content from the xml.
thank you, and thanks for the good work!
-Joel
Hi Joel,
Files are also available from the GovInfo Bulk Data Repository at https://www.govinfo.gov/bulkdata/COMPS.
Also passing your questions to others on the USLM team.
Thanks,
Lisa
Hi Joel,
Thank you for your question.
Because XML has a fundamentally different architecture from relational databases (hierarchical versus tabular), mapping between the two is never easy. Relational database companies have tackled this problem through XML-specific databases (eg. "Oracle XML DB").
Regarding your specific question of how to extract the content of a given XML element (e.g. a given subparagraph) - it is unclear to us if OpenRefine will meet your needs. You might try the parseXml() function on the whole document, then select() the portion you need. But it might may not perform as well as you need.
A 'best practice' solution would be to adopt XML tool(s) that are able to use XQuery to extract information from the XML content set (https://en.wikipedia.org/wiki/XQuery). Tools designed for the XML data model are better suited to reliably extracting data from XML files of any size, than tools which were not designed for XML.