CottageLabs/OpenArticleGauge

Describe process of analysing a dataset and enhancing results for that dataset

Opened this issue · 0 comments

It would probably be good to put up as a page in the docs (and link to it in a sensible manner) what Cameron wrote on 28 May 2014 in an email:

  1. Analyse the set of DOIs that are interest to you – there is something which is not quite an API client reference implementation available at https://github.com/cameronneylon/apcs (pyoag.py) and if you’re using IPython Notebooks then you can see how I’ve used it. This should give you a sense of the success of OAG at the moment in giving you results – I would guess for you set you might see 10-15% success rate.
  2. Analyse which journals/publishers have the worst return. Here you would probably want to try and look at datasets that should be enriched with articles that should have a license associated with them if you can. You’ll probably find gaps as you move away from biomedical sciences (where we’ve focussed) into physics, chemistry and then it will probably be worst for non-english journals in social sciences and humanities.
  3. Identify important cases where we can improve the coverage for journals that are of interest. This might be as simple as gathering aggregating by journal those articles where you don’t get a license and seeing which ones have the most. Obtain license statements that can be added to the library (http://oag.cottagelabs.com/publisher/list)
  4. If desired help to write plugins for edge cases that are a problem – one large scale issue is those publishers (T&F for instance) where it is necessary to dig into the actual pdf to find license information.