ReCiterDB is an open source MariaDB database and set of tools that stores publication lists and computes bibliometric statistics for an academic institution's faculty and other people of interest. ReCiterDB is designed to be populated by person and publication data from ReCiter (a machine learning-driven publication suggestion engine) and from third party sources such as NIH's iCite and Digital Science's Altmetric services. The data in the system can be viewed using the ReCiter Publication Manager web application, or it can serve as a stand alone reporting database. For more on the functionality in Publication Manager, see that repository.
This repository contains:
- A MariaDB (an SQL fork) schema for ReCiterDB
- Stored procedures and events for populating and updating that database
- Python and shell scripts for importing data into ReCiterDB
- A Docker file that automates deployment of these components
In conjunction with data from ReCiter, ReCiterDB has been used to answer questions such as the following:
- Senior-authored academic articles in Department of Anesthesiology
- Percentage of full-time faculty publications that were indexed in PubMed with an ORCID identifier
- Publications by full-time faculty added in the past week
- h5 index of full-time faculty
- Which active full-time faculty does any given faculty cite most often on their papers?
- Which faculty publish the most frequently on cancer, overall and by proportion of their total scholarly output?
- What percent of papers published by a given faculty are in collaboration with existing members of the Cancer Center?
- What are the most influential cancer-related papers by members of the Cancer Center?
- Finally, a variety of person-level bibliometric statistics are available through a bibliometric report that can be generated on demand (see sample)
- Installed recent version of MariaDB. It's important to use MariaDB (a fork of MySQL) as opposed to MySQL because the stored procedures that ship with ReCiterDB include several functions that are uniquely supported by MariaDB.
- Populated instance of ReCiter. This is where all the person and publication data live.
- Installation of ReCiter Publication Manager (optional). Needed in case you wish to interact with the data (curate, report on, etc.) in ReCiterDB through a web user interface.
- Download repository to local directory.
- Unzip and move to desired directory.
- Ensure both the
setupReciterDB.py
andretrieveUpdate.sh
shell scripts are executable. You can do so in Terminal by navigating to the directory where these files are located and running the following commands:
chmod +x reciterDbImport.sh
chmod +x retrieveUpdate.sh
- Create the ReCiterDB database and a user with administrative privileges. Generally speaking, you need a user with broad privileges. This user will be creating and updating tables, views, and stored procedures. (That said, the
SUPER
,FILE
, andSHUTDOWN
privileges are not needed.) From the MySQL prompt, run the following:
CREATE DATABASE IF NOT EXISTS `reciterDB` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'admin' IDENTIFIED BY 'insert password';
GRANT ALL PRIVILEGES ON *.* TO `admin`;
- Assert the following environmental variables in your Terminal window (Mac). The first four are from the prior step. The final three are from your existing installation of ReCiter.
export DB_HOST=[db host]
export DB_USERNAME=[user]
export DB_PASSWORD=[password]
export DB_NAME=[db name]
export AWS_ACCESS_KEY_ID=[access key ID]
export AWS_SECRET_ACCESS_KEY=[secret access key]
export AWS_DEFAULT_REGION=[region]
- Run
python3 setupReciterDB.py
. This will set up the database schema, the stored procedures, and events, and populate some baseline data. This script should execute in seconds. - To update ReCiterDB on a daily basis, run
python3 retrieveUpdate.sh
. If you have 20,000 person records, this script may take ~45 minutes to execute.
All of the above are packaged in a Docker file.
...to provide...
ReCiterDb consists of the following components:
File name | Expected frequency | Type | Purpose |
---|---|---|---|
setupReciterDB.py | At initial setup | Python script | Runs three below SQL files which create the database, inserts certain data, and events and procedures. |
createDatabaseTableReciterDb.sql | At initial setup | Database schema | Creates ReCiterDb database and the following tables: • admin_* - tracks users, their roles, and their feedback in Publication Manager • analysis_altmetric_* - bibliometric article-level data from Altmetric API • analysis_override_author_position - a table for manually overriding the inferred author position; there is no way to update these values through the web user interface • analysis_nih_* - bibliometric article-level data from NIH's iCite API • analysis_summary_* - periodically updated, summary-level index tables for articles, authorships, and people; the people included in the analysis_summary_person table reflect the list contained in the analysis_summary_person_scope table, which is maintained by the system admin; these tables are widely used • analysis_special_characters - includes special character to RTF lookups used for generating RTF files • analysis_temp_* - temporary tables used for staging data so that they can be used for outputting files • journal_* - metadata about journals from third-party sources • person_* - data imported directly from ReCiter's Feature Generator API |
insertBaselineDataReciterDb.sql | At initial setup | Data to be imported | Imports following data into existing tables: • roles for Publication Manager application • special characters and their RTF equivalents • Scimago journal rankings • National Library of Medicine (NLM) journals in PubMed |
createEventsProceduresReciterDb.sql | At initial setup | Stored procedures & events | Creates stored procedures which are used to: • populate the analysis_summary_* tables, which function as a performant index, and is useful for querying• generate RTF files Create events that are used for executing certain stored procedures on a nightly basis. |
File name | Expected frequency | Type | Purpose |
---|---|---|---|
retrieveUpdate.sh | Daily | Shell script | Orchestrates the execution of the below five scripts. The expectation would be that this script would run and refresh reporting and bibliometric data on a nightly basis. |
retrieveS3.py | Daily | Python script | Retrieves article and person data from the AWS s3 instance where your ReCiter is installed. |
retrieveDynamoDb.py | Daily | Python script | Retrieves article data from the AWS DynamoDb instance where your ReCiter is installed. |
retrieveNIH.py | Daily | Python script | Retrieves list of PMIDs from ReCiterDB and looks up article-level statistics from NIH's iCite RCR service. These statistics are used to generate bibliometrics. |
retrieveAltmetric.py | Daily | Python script | Retrieves list of PMIDs from ReCiterDB and looks up article-level statistics from Digital Science's Altmetric service. As of Fall 2022, this requires an API key, which in turn requires providing and getting your research use case approved. |
updateReciterDB.py | Daily | Python script | Takes data generated from retrieveS3.py and retrieveDynamoDb.py scripts and loads them into ReCiterDB |
- Define scope of bibliometrics. As an administrator, you have control over the people for whom the system calculates person-level bibliometrics. This allows for download of a person's bibliometric analysis complete with comparisons to institutional peers. To do this, update the populateAnalysisSummaryPersonScopeTable stored procedure which populates the
analysis_summary_person_scope
table. Here at Weill Cornell Medicine, we consider only full-time employed faculty (i.e.,person_person_type.personType = academic-faculty-weillfulltime
). - Importing additional journal-level metrics (optional). ReCiterDB ships with journal impact data from Scimago Journal Rank. If you have another journal level impact metric, which uses ISSN as a primary key, it can be imported into the journal_impact_alternative table.
As the figure describes, the ReCiter suite of applications can fully manage many key steps in institutional publication management.
The key tools and repositories used to perform these steps are:
Repository | Required? | Functionalities |
---|---|---|
ReCiter | yes | • Store identity info (see #1 above) • Coordinate retrieval of articles from PubMed and optionally Scopus • Use machine learning to estimate the likelihood a scholar wrote each article (#3) • Store a person's identity and articles • Share data through web services (#4, #5) |
ReCiter PubMed Retrieval Tool | yes | • Retrieve and normalize publication data from PubMed (#2) |
ReCiter Scopus Retrieval Tool | no | • Retrieve and normalize publication data from Scopus (#2) |
ReCiter Publication Manager | no | • Collect feedback from librarians, department staff on most likely articles a given researcher has authored (#4) • Provides a web interface for generating reports (#6) |
ReCiterDB | optional but would be needed for Publication Manager | • A set of scripts for retrieving data from ReCiter and populating the database (#5) • A relational database for storing publication and bibliometric data (#6) |