CottageLabs/OpenArticleGauge

AIP - Copyright statement only in PDF

Opened this issue · 3 comments

Noting for the future. There is an OA tag on the landing page but nothing that gives license information until you hit the PDF itself.

Something for further down the track.

An example: http://scitation.aip.org/content/aip/journal/jap/114/5/10.1063/1.4817422

Another note for the future:
URL to pdf: http://scitation.aip.org/deliver/fulltext/aip/journal/jap/114/5/1.4817422.pdf?itemId=/content/aip/journal/jap/114/5/10.1063/1.4817422&mimeType=pdf&containerItemId=content/aip/journal/jap

Nothing out of the ordinary, can be generated by a plugin specific to AIP.


Downloading the whole PDF could be problematic - we do have a lot of bandwidth now, but memory consumption could also be a problem. Still, it should supposedly work. if the license string is present in there, but it will be very brittle. For the size, we could chunk up incoming files (regardless of whether they're PDF-s or not) and run all the needed comparisons on the chunks (e.g. of 1 MB). Then if nothing found, next chunk, and so on.

MDPI is another publisher that does this: http://www.mdpi.com/2071-1050/5/7/3095 and the relevant pdf is: http://www.mdpi.com/2071-1050/5/7/3095/pdf

Note for future readers here: we don't download PDF-s anymore, so in order to eventually support statements in PDFs this would have to change. We use the robus python-magic library to check the file header, so it's pretty unlikely a PDF will slip by.