/EAD-to-LUX

EAD to JSON process. Just intended for a proof-of-concept stage of a project.

Primary LanguageXSLT

EAD-to-LUX

EAD to JSON process. Just intended for a proof-of-concept stage of a project.

The current workflow is to take all of the validated EAD3 source files (e.g. from https://github.com/YaleArchivesSpace/Archives-at-Yale-EAD3) and convert to JSON using the EAD3_to_LUX.xsl transformation. That simplifies things quite a bit, since all of the files are already serialized, updated according to local standards, and validated before ever reaching the GitHub repository.

Since I wound up updating the ArchivesSpace EAD3 export process during this project, though, and since those updates will likley not be reflected in our GitHub repository until around the end of June 2020, I decided to re-export local copies of all of the EAD files and validate them again against the EAD3 schema. If interested, I can share those files, but the gist is that I used our normal 'resource-update-feed' ArchivesSpace API endpoint to export the files, then did our normal transformation (e.g. something like java -jar saxon/saxon-he-10.0.jar -s:ead_brbl -o:ead-prepped -xsl:yale.aspace_v2_to_yale_ead3.xsl -threads:4), and the normal validation (e.g. java -jar jing/bin/jing.jar -t ead3.xsd ead-prepped/*.xml).

In order to control the property order of the resulting JSON, the bulk transformation process is currently using Saxon-EE through oXygen XML Editor, but that part could be swapped out easily enough when we don't care about the order. For testing, though, it was helpful to ensure that the output matched the order of the sample documents. In any event, converting the files with oXygen took about 75 minutes when using the JSON Lines output, and closer to 160 minutes when exporting one JSON record per file. Also, my first pass at validating the JSON files took about 4 hours when cycling through the JSON Lines output, but only 1 hour when every record had its own file. I suspect that comes down to other factors, but just noting that my initial attempt took longer overall when opting for the JSON Lines approach due to my validation step (which needs to be refined). Anyhow, my first attempt to run the JSON Lines validation is provided in the validate.py file