CottageLabs/OpenArticleGauge

Modify Models to allow regex plugins to use Publisher configs

Opened this issue · 5 comments

The OUP plugin is getting out of hand and it would be helpful to move it from a static license based approach to one like the SpringerLink plugin that can check publisher configs for any matching URLs.

The SpringerLink plugin uses the classmethod find_by_journal_url() of models.Publisher which searches by an exact match for a URL. This needs to be done in reverse for a regex plugin. ie to identify the relevant publishers we need to ask whether any of the registered URLs for any config are matched by the regex.

The models.Publisher classmethod all_journal_urls() may be the way to do this? Or do we need to expose another class method which tests for a match to a regex?

I think this would have to be new functionality added to the GSM itself, the ability to execute one or all of the journal_url fields on the publisher reg. form as regexes against incoming article URL-s. We've done it before in crude form with http://idfind.cottagelabs.com/ which identifies identifiers using crowdsourced regexes - what you're describing is basically a subset of that problem, identify a set of URL-s.

True. At the moment the GSM doesn't seem to use the method that I mentioned. Not sure whether this is deliberate or not. I'm not sure that it matters, I guess the questions is one of where does that capability best sit, and how you recognise that something is a regex?

Oh, there'd be no way to recognise a string as a regex, the person submitting it would have to tell you it is one. Normal URL-s are valid regexes too after all (they just match themselves, maybe the dots would act as wildcards).

Just leaving this here for future: if this is implemented, it's probably a good idea to validate the regexes on form submission (re.compile them) and if they are not valid, point the user to the right page & section of the python docs so they at least have a chance of achieving what they want.

The models.Publisher classmethod all_journal_urls() may be the way to do this? Or do we need to expose another class method which tests for a match to a regex?

This fetches all journal URL-s in all of the data.

The OUP plugin is getting out of hand and it would be helpful to move it from a static license based approach to one like the SpringerLink plugin that can check publisher configs for any matching URLs.

Actually, rereading this part, I'm not sure if this is so... I thought you meant we need to allow people to submit regexes in the journal URL field, so that the system can then match incoming article URL-s against the regexes. And if one of them matches, then its config will be used.

Why would we apply a regex to all the URL-s in configs? I.e. what would 1 of them matching actually mean? It's not an incoming article URL for us to say "well we should use the OUP config", it's just a bunch of other configs.

So my understanding (possibly mistaken) was that the SpringerLink approach
was to have the plugin collect any relevant configs that match the URL
currently being handled, and then let the plugin do the work of diving down
to the additional page.

So the flow is:

DOI -> URL
URL recognised by SpringerPlugin
SP gathers all configs that match the current URL and uses this to add to
the set of licenses to check for

So this doesn¹t work for the OUP case because we¹d need to populate the OUP
config with all possible licenses. What I was thinking was with a modified
class method we could search all configs for those that match the regex we
used to trigger the plugin. As you say this is a bit backwards now I think
about it.

For OUP the issue is that we¹re discovering additional licenses texts and
its a pain to add them to the plugin. In this case updating the GSM to
enable matching based on a regex in the config would solve the problem ­ we
wouldn¹t need the OUP plugin any more at all.

There are other cases where we have both a need for regex to identify the
URLs and the license is not on the dereferenced page but on some subsidiary
page. In this case we still need a plugin but its useful to have a config to
make it easier to add new license statements. One option to solve this would
be to hard code a config for the plugin i.e. Have the plugin load a specific
config to which we can add license statements. This seems fragile.

An alternative requires a way for a plugin to determine which configs are
relevant that doesn¹t rely on a perfect match of the URL. One way would be
to allow a regex in the config URL field and to search for that (i.e. is the
regex that identifies the plugin present in the config ­ we could actually
use the existing Publisher.all_journal_urls class method to do this as we
currently do for SpringerLink). This is still Œhard coding¹ the config in
some ways.

So this would go:
DOI -> URL
URL matches plugin regex
Search configs for ŒURLs¹ that match the regex (either as a string or
compiled should work?) gather license statements
Follow plugin logic to obtain the correct page with the license statement
Match against the gathered statements
Return license

In this case the config URL is not used to match against the page URL but as
a key for the plugin to recognise which configs are the ones to gather.

Clear as mud?

From: Emanuil Tolev notifications@github.com
Reply-To: CottageLabs/OpenArticleGauge
<reply+i-31648549-968bb6cdfde652d57bd326af3e863ab9a6f2a3bf-33832@reply.githu
b.com>
Date: Monday, 2 June 2014 09:10
To: CottageLabs/OpenArticleGauge OpenArticleGauge@noreply.github.com
Cc: Cameron Neylon cn@cameronneylon.net
Subject: Re: [OpenArticleGauge] Modify Models to allow regex plugins to use
Publisher configs (#88)

The models.Publisher classmethod all_journal_urls() may be the way to do
this? Or do we need to expose another class method which tests for a match to
a regex?

This fetches all journal URL-s in all of the data.

The OUP plugin is getting out of hand and it would be helpful to move it from
a static license based approach to one like the SpringerLink plugin that can
check publisher configs for any matching URLs.

Actually, rereading this part, I'm not sure if this is so... I thought you
meant we need to allow people to submit regexes in the journal URL field, so
that the system can then match incoming article URL-s against the regexes. And
if one of them matches, then its config will be used.

Why would we apply a regex to all the URL-s in configs? I.e. what would 1 of
them matching actually mean? It's not an incoming article URL for us to say
"well we should use the OUP config", it's just a bunch of other configs.


Reply to this email directly or view it on GitHub
<#88 (comment)
29> .