keeps/commons-ip

Cannot create SIP with ID containing non-filesystem-safe characters

prettybits opened this issue · 4 comments

When trying to create a SIP for an object that has a DOI as identifier and thus contains slashes the CLI exits with an error

$ java -jar commons-ip2-cli-2.2.1.jar create -sid 10.33510/nls.js.AC12345678 -rd media -mf MODS.xml
ERROR

Can't create the sip

I adjusted the relevant points in the code to replace slashes with underscores when determining zip file and root folder names for quick testing, which worked, but lead to a whole host of follow-on validation errors for the generated SIP, since this involves path comparisons involving paths generated from the unescaped identifier.

As far as I could find there don't seem to be explicit restrictions for allowed characters in the SIP ID / OBJID, so using a DOI should be possible here I think. Successful validation of CSIPSTR2 will not be possible when escaping the identifier for filesystem-safe characters, but other validations involving paths inside the package should still succeed and not involve the original identifier.

@AntonioG70 @luis100

I see you started to tackle this issue in #131 and #134, thanks for that! (You accidentally left the backslash in as an allowed character, I added a comment there as well)

From the places I found where this issue strikes there are still two left: checkRootFolderName in ZipManager.java and FolderManager.java, otherwise validation will fail for CSIP1.

This needs to be considered in the specification itself, and also more comprehensive to ensure compatibility with different operative systems. I've created an issue in the specification github: DILCISBoard/E-ARK-CSIP#700

A reference for illegal characters and reserved file names is here, but as the following post refers, it is going to be impossible to anticipate every issue.

Although we have an approach to solve this, and have implemented it, the creation of SIPs with object ids that would end up being URL encoded might produce E-ARK SIPs that would not pass validation on ther validators besides commons-ip. There needs to be an update to the specification to define how these cases should be dealt with. Such specifiation change request was presented at: DILCISBoard/E-ARK-CSIP#700, please re-inforce it by voting it up.

Dear @luis100, as suggested I upvoted the specification issue you opened and also added a comment of my own, thanks for opening it!

You say you implemented a fix for this by percent-encoding the folder names, but I don't see any corresponding change in the commons-ip code here? My previous comment about backslashes still being allowed, and two locations still needing adjustment without which validation for CSIP1 will fail - which I believe it shouldn't in cases of OBJID<->folder name mismatches, see my comment in the spec issue - still apply.