greenelab/lab-website-template

orcid plugin duplicates entries

trife opened this issue · 4 comments

trife commented

Checks

Link to your website repo

https://github.com/FieldPheno/fieldpheno.org

Version of Lab Website Template you are using

1.2.0

Description

It looks like the orcid plugin is duplicating (or not merging) entries that have multiple sources. As far as I can tell, the resulting entries don't have titles, so can be filtered with regex (filters="title: [\S\s]+[\S]+"), but this also appears to be an issue on the Greene Lab page (scroll to the bottom).

The entries are merged together by id, and no other field. We do it by id to guarantee we're using a globally unique identifier, unlike other fields which could have name collisions (unlikely but possible). All of the entries at the bottom of that Greene Lab page are unique publications with unique ids, even if all of their other info is blank. I just verified that they are all unique with this code snippet you could paste into the dev tools on the page:

all = $$(".citation-details").map(e=>e.innerText.split("·").at(-1).trim());
unique = new Set(all);
all.length === unique.size;

All these publications with [no title info], [no author info], etc. are unfortunately due to limitations with Manubot which generates the full citations. It just doesn't know how to cite every type of identifier in existence, including for example eid:2-s2.0-85044966521 returned from ORCID.

So, this is behaving as intended. However, it could be argued that the behavior of the cite process should be that, if the source has an id but Manubot can't cite it, it doesn't get included in the output citations. Though I'm not sure. Someone could conceivably want to have that list of just ids there to display somewhere on their site, but not with the citation component where it looks dumb if there's no title/authors/etc. Also, someone could choose to keep some of the Manubot-uncitable ids, and manually fill in their details in sources.yaml as described here.

trife commented

That makes sense. Since most of these empty citations are from objects that also have ids that can be cited by Manubot, one potential solution would be to prioritize citable id's and include any others as alt-ids in the outputted object. If no citable id is found, the non-citable id is used.

Regardless, the regex solves this for me and since that's a whole separate feature/development, I'll go ahead and close this. Thanks for all your work on the template!

If I understand the suggestion correctly, citations.yaml output would contain entries that have an id for sources that were citable by Manubot and an alt-id for sources that weren't. I think this might confuse people, and make the cite.py script more hairy. I might opt just for a simple flag at the top of the script like keep_uncitable_ids or something.

Also, if you want to use the list component to filter out undefined items, I believe just filters="title: .+" should be sufficient. Please let me know if that doesn't work. You can see /_plugins/misc.rb data_filter() for how the filtering works (though it might be a bit tough to parse).

trife commented

You're definitely right that it would overcomplicate that script. That updated filter works for me- my attempt was based on some quick googling. I think it's best for everyone if I avoid trying to understand how the filtering works 😂