internetarchive/fatcat

ISSN-L matching for JURN index

bnewbold opened this issue · 2 comments

JURN is "An organised links directory for the arts & humanities, listing selected open access or otherwise free ejournals." They list 3000-4000 such journals by name, URL, and category at http://www.jurn.org/directory/, and an additional 800 ecology titles at https://jurnsearch.wordpress.com/titles-indexed-ecology-related/.

It would be great to include these in fatcat (probably via chocula first, though could go direct via API as well), and mark them as open so they will be included in broad IA crawls for preservation. However, JURN doesn't link any persistent identifiers (eg, wikidata QID or ISSN/ISSN-L), which makes it hard to reference them anywhere without duplication.

Some brainstorms of how to go about this:

  • query existing fatcat by both fuzzy title match or URL match, using "container" metadata dump
  • same as the above, but using Wikidata tooling, eg openrefine
  • query portal.issn.org by title
  • visit each journal homepage and try to parse out an ISSN; verify this ISSN against portal.issn.org
Phu2 commented

I scraped the journal names from the JURN directory website, loaded them in OpenRefine and ran the reconciliation service against Wikidata. By automatic matching best candidates and some manual matching i got 990 matches out of 3311 journals. For these i tried to add the Wikidata-ID, ISSN and ISSN-L. Here is the comma separated file exported from OpenRefine:
jurn-directory-csv.txt
Can you use this for some good?

Hi @Phu2, sorry for the slow reply on this. It is helpful!

I wonder what we could do to increase the matching or confirm that the un-matched results are actually missing ISSNs. Could we have OpenRefine try to reconcile against the fatcat container list instead of wikidata? There are JSON dumps here:

https://archive.org/details/fatcat_bulk_exports_2020-08-05

or I could supply a .csv file if you let me know which column fields to include.