NationalMuseumAustralia/Collection-API-ETL

Split source XML files to reduce XML-to-SPARQL ETL memory consumption

Opened this issue · 2 comments

  • Split each source file into a folder of individual record files, using streaming file splitter.
  • Refactor XML-to-SPARQL pipeline to individually load record files from these folders, and pass to the RDF conversion XSLT.
  • Pass the record's type to the conversion XSLT as a parameter (replacing the file type recognition code in the XSLT)
  • Replace the stylesheet which marks some Piction images as preferred with equivalent SPARQL update query.

Replace the Piction stylesheet with a SPARQL query first, since that part doesn't depend on the other (XML splitting) changes.

Waiting for go-ahead