UCM Agent Bulkload Request
Closed this issue · 18 comments
cf_temp_pre_bulk_agent_download_version ready.csv
Please bulkload the agents in the attached file.
Note: The file should be results from the Agent Prebulkload Tool. If the file is too large for Github attachments, comment here and an email address or shared file space will be provided to you.
S-C Lee
C-C Chen
C-P Chen
J-T Chao
and others with a dash in preferred name. First names should not include punctuation other than a period. Are we sure these are people and should they be:
S. C. Lee
C. C. Chen
C. P. Chen
J. T. Chao
?
R/V Soyo-Maru
Is not a person, but a research vessel? If so, this may be added as an organization.
A. C. Burrill
R. C. Burrill
This really feels like someone somewhere mis-transcribed an A for an R or the other way around?
W. F. Halliday
W. R. Halliday
Ditto for the F and R in these two
Mr. A. E. Collins
Mrs. A. E. Collins
And the D and K here
D. A. Han
K. A. Han
Add the "spouse of" relationship between these after they are added?
Will Eberle-Taylor
Nick Eberle-Taylor
Quinn Eberle-Taylor
I assume these people are related? Do we know how?
Can I be convinced that these are really not the same person?
William W. Hay
W. Hay
Or these two?
Norman E. A. Hinds
Norman E. C. Hinds
All of the "not the same as" relationships require a method and determiner.
I am not trying to be obstructionist, but it seems like there is still some cleanup that could be done before we add these agents? I stopped looking at the near matches, so there are probably others I would add to the categories above.
No worries. Thanks for catching those. Updated agent list attached:
cf_temp_pre_bulk_agent_download_final version.csv
@dustymc Thanks for including me in the #7649 issue. Maybe we should pair our list down so that the only agents that get uploaded are ones that have full names (i.e., no initial) or have one (or more) attribute that distinguishes them (makes them unique) from other agents? So, for instance if we have a J. Smith the only way we can upload that person as an agent is if we had an attribute, say "child of", linked to that agent. Would that work?
So, for instance if we have a J. Smith the only way we can upload that person as an agent is if we had an attribute, say "child of", linked to that agent. Would that work?
That will help, but the ones I am struggling with include things like
Barbara Waleis which feels like it may be a mistranscription of Barbara T. Waters
Charles A. Nelson feels like a mistranscription of Charles D. Nelson (or perhaps it is the other way around, A and D can look very similar when written or maybe these ARE two different people, but I have no way to decide that)
Chin-Tsong Lewis and Chin-Tsong Lo - one of these must be a misspelling, an alternate name for the same person, or are they related people?
You may have no way to figure out if my "feelings" are justified, but if you do, it might be good to get things like this sorted before making agents.
As before, I did not peruse the entire list to look for these internal issues, but there are probably others! Do not take this as a summary of everything that I think needs review - just ideas for looking at the data you have in-house even before comparisons to Arctos agents.
Barbara Waleis which feels like it may be a mistranscription of Barbara T. Waters
I can confirm that Barbara Waleis and Barbara T. Waters are two different people. Waleis is a collector from the 1930s, while Waters is a collector from the 1980s.
The others are all agents for the invert zoo collection, which will need to be checked by @Krmartin3 when she gets back from vacation. I can say that the Chinese do use hyphenated first names. So, Arctos may need to figure that one out, but I'll let Kelly chime in when she is back.
Charles A. Nelson feels like a mistranscription of Charles D. Nelson (or perhaps it is the other way around, A and D can look very similar when written or maybe these ARE two different people, but I have no way to decide that)
Chin-Tsong Lewis and Chin-Tsong Lo - one of these must be a misspelling, an alternate name for the same person, or are they related people?
In the mean time, I'm going to pull all of invert zoo's agents from the sheet, as I think most of the issues are coming from that side (sorry Kelly). I'll reupload a new sheet of agents here in a bit.
@Jegelewicz new list of agents attached
cf_temp_pre_bulk_agent_vert paleo agents only.csv
@javanveldhuizen the dates in that CSV have been mangled (probably by Excel?).
@dustymc Interesting, the dates look fine on my end.
Should I use a different program to edit the CSV instead?
@dustymc Ok. I edited the CSV using Notepad and changed all the dates into the desired format: yyyy-mm-dd. Let me know if that doesn't work.
look fine
Yea, but they don't SAVE fine (eg unambiguously), which is why we require CSV.
https://handbook.arctosdb.org/how_to/How-to-Excel-for-Arctos.html#dates (I wrote the 'eat your data' bits but not the niceties at the top!)
Thanks, I've got those in the pre-loader.
The first thing in my view is "Humboldt Museum" - surely that's https://arctos.database.museum/agent/21336826 or https://arctos.database.museum/agent/21348575??
The first thing in my view is "Humboldt Museum" - surely that's https://arctos.database.museum/agent/21336826 or https://arctos.database.museum/agent/21348575??
It's kind of actually neither of those things. The specimens I have tied to the Humboldt Museum were donated to us from a researcher at the Humboldt-Universität zu Berlin. What's unclear is whether these were actually part of the museum at that university, which later became the Museum fuer Naturkunde der Humboldt-Universitaet Berlin, or if they were a part of a researchers lab collection. I kept is Humboldt Museum until I could fully untangle it. Feel free to delete it from the list if you feel that it is not an appropriate true agent.
@dustymc Here is the agent sheet again with the Humboldt Museum removed
cf_temp_pre_bulk_agent_vert paleo agents only.csv
.
you feel
Ugh, that should not be the path, @ArctosDB/agents-committee HELP!
Lacking further guidance, that seems a somewhat defensible position to me (and a remark would be useful, if that's not already there).
I loaded data to https://docs.google.com/spreadsheets/d/1it7JgDc0Fxnccn5yD_bO6kdYFjPRrbJhqptOVAOu3G8/edit?gid=907589706#gid=907589706
Again an "interesting" situation on the first line!
First your agent will load, then Arctos will run....
arctosprod@arctos>> select getAgentID('David Taylor');
getagentid
------------
21333592
except two results will be returned - this one and the one just created - which will result in an error. Maybe that's somehow my problem, but I'm not quite sure how to address it. https://arctos.database.museum/agent/21333592 will always be unambiguous, but isn't great for humans to work with in a spreadsheet.
Beyond that, I don't know how to proceed. (I'd use verbatim agents as a first pass so we don't have to guess from strings, but I seem to have lost that argument!)
<style type="text/css"></style>
person | Sarah E. Rieboldt | attribute match: first+last variants Sarah Rieboldt person | first name | Sarah | middle name | E. | last name | Rieboldt | not the same as | Sarah Reiboldt | 2024-07-01 | Jacob Van Veldhuizen | dlm |
---|
<style type="text/css"></style>
person | Bill Simpson | attribute match: first+last variants William Simpson person | first name | Bill | last name | Simpson | not the same as | William Simpson | 2024-07-01 | Jacob Van Veldhuizen | dlm |
---|
<style type="text/css"></style>
organization | Brigham Young University Museum of Paleontology | attribute match: aka Brigham Young University Life Science Museum organization | aka | BYU | Wikidata | https://www.wikidata.org/wiki/Q4836911 | not the same as | Brigham Young University Life Science Museum | 2024-07-01 | Jacob Van Veldhuizen | dlm |
---|
look pretty suspicious (and maybe that's OK, I don't know, this should still not be my call @ArctosDB/agents-committee !!)
I didn't scroll very far, just enough to grab a couple examples.
I don't see any super-obvious duplicates or mistyped agents or such in the file. I REALLY don't want this to be my call (see above, I'd do something entirely different!), and the ~30 flagged by the checker could definitely use careful review, but loading this doesn't seem unreasonable.
@Jegelewicz @mkoo thoughts??
arctosprod@arctos>> select getAgentID('David Taylor');
getagentid21333592
I have deleted David Taylor from my list and will make him a verbatim agent for now until that issue is fixed. I can confirm that the David Taylor already in Arctos is not the same David Taylor in my data.
person Sarah E. Rieboldt attribute match: first+last variants Sarah Rieboldt person first name Sarah middle name E. last name Rieboldt not the same as Sarah Reiboldt 2024-07-01 Jacob Van Veldhuizen dlm
<style type="text/css"></style>
For some reason Sarah Reiboldt keeps reappearing in this list even though I keep deleting it. Anyway, I've deleted it once again and I can confirm that the Sarah Reiboldt already in Arctos is the same Sarah Reiboldt in my data.
person Bill Simpson attribute match: first+last variants William Simpson person first name Bill last name Simpson not the same as William Simpson 2024-07-01 Jacob Van Veldhuizen dlm
<style type="text/css"></style>
The Bill Simpson I have in my data is an amateur collector in the Denver area and not the William Simpson already in Arctos. These are two separate people, as indicated by the "not the same as" attribute.
organization Brigham Young University Museum of Paleontology attribute match: aka Brigham Young University Life Science Museum organization aka BYU Wikidata https://www.wikidata.org/wiki/Q4836911 not the same as Brigham Young University Life Science Museum 2024-07-01 Jacob Van Veldhuizen dlm
The BYU Museum of Paleontology and the BYU Life Science Museum are two different organizations. Here are their websites so you can confirm:
New list here:
cf_temp_pre_bulk_agent_vert paleo agents only.csv
David Taylor
You can also just create the agent manually (where everything involved IDs instead of strings).
as indicated by the "not the same as" attribute
Sorry, I didn't look very carefully (was aiming for general considerations, not specifics!), thanks!
New list
running....
https://docs.google.com/spreadsheets/d/1SBF83EZncUko6u1KkVzbQdhaPGDULnNVNKuSEn6Leak/edit?usp=sharing
I suppose I should just load that??? @mkoo
@javanveldhuizen I found a problem on my end and am rolling a partial load back, but during that I noticed
Ward Scientific
Wards National Science
in these data. Surely those are both duplicates of https://arctos.database.museum/agent/21293521?
@dustymc I deleted those agents. They need some verification. New list here:
cf_temp_pre_bulk_agent_vert paleo agents only.csv
Done and blamed on you @javanveldhuizen
There's one full-duplicate low-data copy of another low-data agent that maybe ought to have something done with it.
agent_id | agent_type | preferred_agent_name | creator | created_date
----------+------------+----------------------+----------------------+----------------------------
21354938 | person | Scott Parker | Jacob Van Veldhuizen | 2024-07-03 14:40:17.101114
21257771 | person | Scott Parker | unknown | 2013-12-16 21:49:31
(2 rows)
and one that errored out