ArctosDB/arctos

Feature Request - control what identifiers an agent may issue

Opened this issue · 9 comments

Is your feature request related to a problem? Please describe.

See #7808 (comment)

"Collection agents" (those who wear a collectionID) are issuing all sorts of identifiers. Most of this does not seem realistic to me. Probably other agents are also being recorded as issuing identifiers which they did not actually issue.

Describe what you're trying to accomplish

  1. Clean data.
  2. Make it easy for users to understand how to do things.

Describe the solution you'd like

Some mechanism to control what identifiers certain agents may issue.

Describe alternatives you've considered

Do nothing.

Additional context

I can dig around in the data if there's any interest in proceeding with this.

Priority

Not clear; preventing future tangled messes seems important to me, this has caused some problems already (#7025), but both actionable (eg URLs) and local (eg, NK) identifiers can mostly function without this information so filtering is possibly of relatively little value.

I think this probably makes sense. It would be good to see an updated version what the problems are currently.

updated version

#7808 will make it much easier to get what caught my eye, but I think it's probably not possible to know what the problems are until they become problematic. It's hard to imagine that random agents doing random stuff won't find a way to get weird, but IDK if that ==> "problem."

OK, dug around a bit, yea mess.

https://arctos.database.museum/search.cfm?id_issuedby=%3DWoodland%20Park%20Zoo&oidtype=collector%20number - collector number "refers to a person's field catalog" that's just wrong (but possibly/hopefully in some way which doesn't much change functionality).

https://arctos.database.museum/search.cfm?id_issuedby=%3DBeverly+J.+Witte&oidtype=field+number - don't think VP uses 'unique number assigned to a collecting event'

Lots of #7836 messes (people are not institutions....)

Here's some data, maybe it'll lead somewhere, this seems to be enough to understand that the intersection of type and agent is at least sometimes arbitrary, which seems a bit sub-optimal to me:

https://docs.google.com/spreadsheets/d/1jdC08vXtbdNhVXDIUz2qwx8ZDkLwds0VpWr4mZrTvXk/edit#gid=714618271

From that I noticed 500+ malformed (==broken) genbank links that are attributed to the correct agent (so controllable/detectable)

arctosprod@arctos>> select  count(*)   from coll_obj_other_id_num where issued_by_agent_id=21349032 and substr(display_value,0,37) != 'http://www.ncbi.nlm.nih.gov/nuccore/' ;
 count 
-------
   571

Some of them don't seem to be that at all, BUT NCBI seems to have a pretty good 404 handler - I guessed that SRR19593543 (https://arctos.database.museum/search.cfm?id_issuedby=%3DNCBI%20Nucleotide%20(GenBank)&oidnum=%3DSRR19593543) should be http://www.ncbi.nlm.nih.gov/nuccore/SRR19593543 (it should not) and got magicked to https://www.ncbi.nlm.nih.gov/sra/SRR19593543 (which should have been issued by https://arctos.database.museum/agent/21349034, not https://arctos.database.museum/agent/21349032).

So yea, clearly a problem, at least partially involving our inability to complete the migration. I'll bump priority.

collector number "refers to a person's field catalog"

That's how we use it here at the UWYMV, and since it refers to a person's individual catalog I would think you would want an Agent to have be the one associated with that collector number. The list you provided seemed to have a lot of examples of how I thought the new system was supposed to be used?

lot of examples of how I thought the new system was supposed to be used?

Yup, most of the list seems to be just fine, and yay us for that. There are no filters on my query, that's just every agent who's issued anything and they types they've issued, a first-step exploratory view.

Oh phew. I thought things were changing again.

mkoo commented

Thanks-- I took a look at the google spreadsheet. It's helpful but also seems to include a lot of legit data too. Here's one way to view the data:
Organizations vs People

In Orgs, these types (identifier and institutional catalog number) are redundant so merge/ cleanup seems simple.

For people, many have both collector number and preparator number. That's a curatorial distinction that I foresee we not going to be able to solve here in the presentday. Maybe in the future. Ideally we'd just call it catalog number or personal catalog or something else since obviously it's usually the same series of numbers-- it's just traditional to carry a little more distinction that someone just prep the specimen vs did field work (thus, do you bother looking for a field journal or not?) We do have better ways to distinguish of course (agent roles) but for the identifier_type? Not the best place for that! But sure, I predict a fight to retain this ancient methodology.

So maybe we limit people to just two types moving forward
And proposed clean-up and remove the low frequency ones like processing number, field number, etc. I suggest the CT committee review tomorrow

lot of legit data

Yes, #7837 (comment)

(identifier and institutional catalog number) are redundant

Yes, these are used completely interchangeably/arbitrarily, see also #7836, cleaning up any bit of this mess brings clarity to the rest.

people, many have both collector number and preparator number.

Yep, no problem.

field number

Yea, I have no idea what to do about that. 99.9999% of the usage outside fish collections is just wrong, but I'd not want to try to write code to catch that either. Call it small fry and ignore for now...

Organizations vs People

Interesting, and that would catch much of the most-obvious "this can't possibly be right...." usage. BUT....

See eg #7649, there are a ton of clearly-not-people agents entered as people, our not-great data kinda always contaminates something else...

fish collections ... small fry

I see what you did there....

#7649 (comment) + #7836 ==> https://arctos.database.museum/guid/MSB:Mamm:145728

Screenshot 2024-06-27 at 14 30 34

A low-quality person-agent acting as an institution doesn't seem optimal.