ReCiterDB

Summary
Functionality
Technical
- Prerequisites
- Installation
  - Locally
  - On AWS
Components
More on the ReCiter suite of applications

Summary

ReCiterDB is an open source MariaDB database and set of tools that stores publication lists and computes bibliometric statistics for an academic institution's faculty and other people of interest. ReCiterDB is designed to be populated by person and publication data from ReCiter (a machine learning-driven publication suggestion engine) and from third party sources such as NIH's iCite and Digital Science's Altmetric services. The data in the system can be viewed using the ReCiter Publication Manager web application, or it can serve as a stand alone reporting database. For more on the functionality in Publication Manager, see that repository.

This repository contains:

A MariaDB (an SQL fork) schema for ReCiterDB
Stored procedures and events for populating and updating that database
Python and shell scripts for importing data into ReCiterDB
A Docker file that automates deployment of these components

Functionality

In conjunction with data from ReCiter, ReCiterDB has been used to answer questions such as the following:

Senior-authored academic articles in Department of Anesthesiology
Percentage of full-time faculty publications that were indexed in PubMed with an ORCID identifier
Publications by full-time faculty added in the past week
h5 index of full-time faculty
Which active full-time faculty does any given faculty cite most often on their papers?
Which faculty publish the most frequently on cancer, overall and by proportion of their total scholarly output?
What percent of papers published by a given faculty are in collaboration with existing members of the Cancer Center?
What are the most influential cancer-related papers by members of the Cancer Center?
Finally, a variety of person-level bibliometric statistics are available through a bibliometric report that can be generated on demand (see sample)

Technical

Prerequisites

Installed recent version of MariaDB. It's important to use MariaDB (a fork of MySQL) as opposed to MySQL because the stored procedures that ship with ReCiterDB include several functions that are uniquely supported by MariaDB.
Populated instance of ReCiter. This is where all the person and publication data live.
Installation of ReCiter Publication Manager (optional). Needed in case you wish to interact with the data (curate, report on, etc.) in ReCiterDB through a web user interface.

Installation

Locally

Download repository to local directory.
Unzip and move to desired directory.
Ensure both the setupReciterDB.py and retrieveUpdate.sh shell scripts are executable. You can do so in Terminal by navigating to the directory where these files are located and running the following commands:

chmod +x reciterDbImport.sh
chmod +x retrieveUpdate.sh

Create the ReCiterDB database and a user with administrative privileges. Generally speaking, you need a user with broad privileges. This user will be creating and updating tables, views, and stored procedures. (That said, the SUPER, FILE, and SHUTDOWN privileges are not needed.) From the MySQL prompt, run the following:

CREATE DATABASE IF NOT EXISTS `reciterDB` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'admin' IDENTIFIED BY 'insert password';
GRANT ALL PRIVILEGES ON *.* TO `admin`;

Assert the following environmental variables in your Terminal window (Mac). The first four are from the prior step. The final three are from your existing installation of ReCiter.

export DB_HOST=[db host]
export DB_USERNAME=[user]
export DB_PASSWORD=[password]
export DB_NAME=[db name]
export AWS_ACCESS_KEY_ID=[access key ID]
export AWS_SECRET_ACCESS_KEY=[secret access key]
export AWS_DEFAULT_REGION=[region]

Run python3 setupReciterDB.py. This will set up the database schema, the stored procedures, and events, and populate some baseline data. This script should execute in seconds.
To update ReCiterDB on a daily basis, run python3 retrieveUpdate.sh. If you have 20,000 person records, this script may take ~45 minutes to execute.

On AWS

All of the above are packaged in a Docker file.

...to provide...

Components

ReCiterDb consists of the following components:

Setup

File name	Expected frequency	Type	Purpose
setupReciterDB.py	At initial setup	Python script	Runs three below SQL files which create the database, inserts certain data, and events and procedures.
createDatabaseTableReciterDb.sql	At initial setup	Database schema	Creates ReCiterDb database and the following tables: • `admin_` - tracks users, their roles, and their feedback in Publication Manager • `analysis_altmetric_` - bibliometric article-level data from Altmetric API • `analysis_override_author_position` - a table for manually overriding the inferred author position; there is no way to update these values through the web user interface • `analysis_nih_` - bibliometric article-level data from NIH's iCite API • `analysis_summary_` - periodically updated, summary-level index tables for articles, authorships, and people; the people included in the analysis_summary_person table reflect the list contained in the `analysis_summary_person_scope` table, which is maintained by the system admin; these tables are widely used • `analysis_special_characters` - includes special character to RTF lookups used for generating RTF files • `analysis_temp_` - temporary tables used for staging data so that they can be used for outputting files • `journal_` - metadata about journals from third-party sources • `person_*` - data imported directly from ReCiter's Feature Generator API
insertBaselineDataReciterDb.sql	At initial setup	Data to be imported	Imports following data into existing tables: • roles for Publication Manager application • special characters and their RTF equivalents • Scimago journal rankings • National Library of Medicine (NLM) journals in PubMed
createEventsProceduresReciterDb.sql	At initial setup	Stored procedures & events	Creates stored procedures which are used to: • populate the `analysis_summary_*` tables, which function as a performant index, and is useful for querying • generate RTF files Create events that are used for executing certain stored procedures on a nightly basis.

Update

File name	Expected frequency	Type	Purpose
retrieveUpdate.sh	Daily	Shell script	Orchestrates the execution of the below five scripts. The expectation would be that this script would run and refresh reporting and bibliometric data on a nightly basis.
retrieveS3.py	Daily	Python script	Retrieves article and person data from the AWS s3 instance where your ReCiter is installed.
retrieveDynamoDb.py	Daily	Python script	Retrieves article data from the AWS DynamoDb instance where your ReCiter is installed.
retrieveNIH.py	Daily	Python script	Retrieves list of PMIDs from ReCiterDB and looks up article-level statistics from NIH's iCite RCR service. These statistics are used to generate bibliometrics.
retrieveAltmetric.py	Daily	Python script	Retrieves list of PMIDs from ReCiterDB and looks up article-level statistics from Digital Science's Altmetric service. As of Fall 2022, this requires an API key, which in turn requires providing and getting your research use case approved.
updateReciterDB.py	Daily	Python script	Takes data generated from retrieveS3.py and retrieveDynamoDb.py scripts and loads them into ReCiterDB

Configuration

Define scope of bibliometrics. As an administrator, you have control over the people for whom the system calculates person-level bibliometrics. This allows for download of a person's bibliometric analysis complete with comparisons to institutional peers. To do this, update the populateAnalysisSummaryPersonScopeTable stored procedure which populates the analysis_summary_person_scope table. Here at Weill Cornell Medicine, we consider only full-time employed faculty (i.e., person_person_type.personType = academic-faculty-weillfulltime).
Importing additional journal-level metrics (optional). ReCiterDB ships with journal impact data from Scimago Journal Rank. If you have another journal level impact metric, which uses ISSN as a primary key, it can be imported into the journal_impact_alternative table.

More on the ReCiter suite of applications

As the figure describes, the ReCiter suite of applications can fully manage many key steps in institutional publication management.

The key tools and repositories used to perform these steps are:

Repository	Required?	Functionalities
ReCiter	yes	• Store identity info (see #1 above) • Coordinate retrieval of articles from PubMed and optionally Scopus • Use machine learning to estimate the likelihood a scholar wrote each article (#3) • Store a person's identity and articles • Share data through web services (#4, #5)
ReCiter PubMed Retrieval Tool	yes	• Retrieve and normalize publication data from PubMed (#2)
ReCiter Scopus Retrieval Tool	no	• Retrieve and normalize publication data from Scopus (#2)
ReCiter Publication Manager	no	• Collect feedback from librarians, department staff on most likely articles a given researcher has authored (#4) • Provides a web interface for generating reports (#6)
ReCiterDB	optional but would be needed for Publication Manager	• A set of scripts for retrieving data from ReCiter and populating the database (#5) • A relational database for storing publication and bibliometric data (#6)

wcmc-its/ReCiterDB