ropensci/EML

Consider adding a warning to `write_eml` generates random `packageId` and `system` values

Opened this issue · 3 comments

Stemming from #292,

Right now, when an eml object is written to disk with write_eml without a packageId, write_eml fills in packageId with a UUID and sets system to uuid:

EML/R/write_eml.R

Lines 27 to 31 in 6c2911c

## FIXME, insist on a packageId, or consider validation which can ignore id?
if (is.null(eml$packageId)) {
eml$packageId <- uuid::UUIDgenerate()
eml$system <- "uuid"
}

In #292, this created some understandable confusion for @scelmendorf because it meant that eml_validate's behavior was inconsistent depending on whether the object was written to disk first or not.

In #292, @cboettig wrote:

I think the best fix would be that write_eml should throw a warning if no packageId is found, and explain that it is adding an ID automatically. I'd love a PR for that if we have consensus on that.

Also in #292, @mobb wrote:

The temporary packageId and system are fine (as inserted by write_eml). However, users should be aware that this is happening, so I recommend a message to the screen if write_eml inserts these. For those who control their packageIds, the msg will help remind them to assign it.

I agree with the above and it seems we have a consensus but I wanted to file a standalone issue for discussion in case it was needed. Open to comments or suggested wording of the warnings/messages.

I propose:

  1. If packageId is not set when write_eml is called, issue a warning.
  2. The warning will have the text: "packageId generated automatically because it was not already set. See ?packageId for more information."
  3. Create a man page at ?packageId with more information. TBD. It'd explain how to set packageId and system and how to come up with sensible values.

While I fully appreciate the motivation for providing a packageId automatically, I am still uncomfortable with doing so in write_eml, which is meant to simply be a serialization of an in-memory object. Side-effects that aren't in user control often cause problems, and I think what is written to disk should be round-trippable with the in-memory object.

Instead of changing content, couldn't we simply provide warnings for any validation errors (including a missing packageId) so the user will know that their document doesn't validate when it is written to disk? This would allow folks that want to export a partial document for further downstream processing to do so, while still helping to prompt when required fields are missing.

@mbjones Yeah, I'm with you 💯 on that.

I think we should also add a helper routine to add_packageId() that could default to the current UUID method to avoid the possible complexity associated with this task. That would be far more transparent than doing it on write.

Perhaps we can also add some additional logic into the validator which would report that the EML was valid except for a missing packageId (possibly with a suggestion on how to add one?, e.g.)

EML file is valid except for a missing packageId. You can create one now with `add_packageId()`

I've always been a bit annoyed at the standard XML validation errors not being as user-friendly or comprehensive as they should be.

Thoughts?

Sounds great!