This project takes publication data from a single year and determines per article its open access status, using various sources available. It uses peer reviewed journal articles registered in CRIS systems as input. It was created in a collaboration between Utrecht University Library and Leiden University Library. Specifics for Utrecht and Leiden can be found in the Wiki.
The program runs in R through Rstudio. Ensure you have both installed:
- Install R via cran.rstudio.com
- Install Rstudio via rstudio.com
The following packages are required:
- tidyverse
- stringr
- readxl
- jsonlite
- httr
- magrittr
- lubridate
- docstring
- testthat
You can install them in Rstudio by running the following in the console:
install.packages("tidyverse")
install.packages("stringr")
install.packages("readxl")
install.packages("jsonlite")
install.packages("httr")
install.packages("magrittr")
install.packages("lubridate")
install.packages("docstring")
install.packages("testthat")
- Download this project through this zipfile.
- Unzip the download.
- Open
OAmonitoring.rproj
, which will launch Rstudio.
Multiple data files can be processed together in this pipeline. Formats can be .xls
, .xlsx
, .tsv
, and .csv
.
Your data will need to have the following columns to successfully make it through the pipeline:
- Journal ISSN
- paper DOI
- A unique ID (given by your CRIS)
- Organization unit (e.g. faculty, department, etc)
NB: The columns may have missing data, but must exist. The exception to this is the unique internal ID, which must be complete.
- Place your input file(s) in the folder
data
. - Open the file
config_pub_files.xlsx
(in the folderconfig
) in Excel. - For each input file, fill out the grid:
- Place the file name of an input file in the field 'filename'.
- Place the header of each column that contains the data needed in the fields below.
- If there is only a single column with ISSN or EISSN, ensure it is placed in the ISSN field.
- Indicate in the last field what additional columns you want to retain in the output file. If you want to keep all columns, put 'all'; otherwise make a selection of columns, and separate them with a comma and a space (e.g.
Publisher, Authors
). - NB: Don't forget to remove the examples that are currently present in the file! This configuration file may have only data in the columns that describe your own input data, and must be empty otherwise.
This is an excel sheet with DOIs of papers published under the OA Publishing deals through the VSNU, that is available on request (contact b.m.i.vreede@uu.nl if you need this list).
- Place the file in the folder
data
. - Open the file
config.R
(in the folderconfig
). You can do this directly in Rstudio. - Ensure the location of the file and the correct file name are placed behind the variable
path_vsnu
.
- Change the email address in
config.R
to your own. - If you are running this for the first time: make sure the variables
use_doaj
anduse_upw
are set toapi
. - Once the script has run successfully, and files have been generated with the results from this api (they will show up in the folder
data/clean
), you can edit the paths underpath_doaj
andpath_upw
to reflect the location and name of these saved files. The variablesuse_doaj
anduse_upw
can now be set tosaved
.
Reporting publication output needs to happen for each HOOP-area (this is the division of Dutch Higher Education and Research into eight categories: Landbouw (Agriculture), Natuur (Nature), Techniek (Technique), Gezondheid (Health), Gedrag en Maatschappij (Behaviour and Society), Economie (Economics), Recht (Law), Taal en Cultuur (Language and Culture). Each faculty or department is assigned to one HOOP category.
You can edit the file HOOP_areas.xlsx
, in the folder config
, to contain the names of organization units (e.g. departments or faculties) that are classified under each HOOP area.
Ensure all names are spelled identical to their appearance in your data.
(Don't forget to remove the mock data entries here.)
Edit the file reports.xlsx
, in the folder config
, with any custom reports you may want to generate.
The title of the report is up to you, but the organization units included must contain names spelled identically to the organization units in your data.
For example: a report for the Faculty of Science would look like this:
Title of the report | Faculty of Science |
---|---|
Organization units included | Departement Biologie |
Departement Scheikunde | |
Departement Natuurkunde | |
Departement Informatica |
All reports can be written up in this file, simply by adding more columns.
Open reporting_pipeline.Rmd
, and run the code.
You can run this item by item by clicking on the green triangle at the top-right corner of every "chunk", or run all at once by clicking the down-arrow next to the button -> Run
, and selecting Run All
.
Results of your run will appear in the folder output
.
The pipeline harvests OA information from the following sources:
- Directory of Open Access Journals, (DOAJ)
- Unpaywall
- Use of VSNU Open Access publishing deals (an excel sheet available through the VSNU)
Each paper in your data will get a OA label assigned based on the OA information above, and applied in sequence:
- Is the journal present in the DOAJ? -> label: GOLD
- Is the DOI present in the VSNU list? -> label: HYBRID
- Is the DOI classified as gold or hybrid by Unpaywall? -> label: HYBRID
- (Optional: is the system ID present in a custom list supplied by the user? -> label: GREEN)
- Is the DOI classified as green by Unpaywall? -> label: GREEN