marklogic-community/Corona

Normalize dates in extracted metadata from binaries

ryangrimm opened this issue · 3 comments

Many binary formats include something along the lines of a creation date or a modification date. These dates can be under various names for various file formats. In order to support various queries and range indexes on this metadata, normalizing these dates into xs:dateTime values would be required.

To do so, the current plan is to attempt to parse the value of any piece of metadata that has date or time in its name. The parsing can be accomplished via the date parser that's already in use. New formats can easily be added if need be.

Will want to normalize the element names as well. Content extracted from PDF ends up with corona:modDate while Word ends up with corona:lastSavedDate (which I believe are conceptually the same thing). I did a quick inventory of a half dozen other formats and that's the main one I saw.

Normalizing last modification metadata to a corona:modDate element. Also running any piece of metadata that has "date" in the name through the date parser. If a date is extracted it's stored in a normalized-date attribute.