The repository accompanies the paper titled A Content-Based Novelty Measure for Scholarly Publications: A Proof of Concept. NovEval is a tool designed to automatically evaluate the novelty of an academic manuscript. Its judgment has been shown to significantly correlate with that of human experts. NovEval estimates how rarely a sequence of words (i.e., the manuscript) occurs in the universe of scholarly discourse using a GPT-2 model trained solely on English Wikipedia. See our manuscript for details.
The repository contains scripts to reproduce the GPT-2 model and the reported experiments.
Play with it now at NovEval.
The trained GPT-2 model is hosted at Zenodo.
For those who want to reproduce the model, use the following code.
# loaded modules on IU's Big Red 200 cluster:
# 1) craype-x86-rome 5) gcc/11.2.0 9) cray-libsci/21.08.1.2 13) nano/6.4
# 2) libfabric/1.11.0.3.71 6) craype/2.7.14 10) PrgEnv-gnu/8.3.2 14) cudatoolkit/11.7
# 3) craype-network-ofi 7) cray-dsmml/0.2.2 11) xalt/2.10.34 15) python/3.10.5
# 4) perftools-base/21.12.0 8) cray-mpich/8.1.14 12) git/2.34
python3.10 -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
torchrun --standalone --nproc_per_node=4 train.py config/train_gpt2_wikipedia_en.py
The model was trained during the update from Pytorch 1.0 to 2.0; thus, dependencies may have conflicts with each other.
# for token- and sentence-level sanity check
python -m face_validity.sh
# for section-level sanity check
python -m known_group_validity.sh
0BSD
@inproceedings{wang2024noveval,
author = {Haining Wang},
title = {A Content-Based Novelty Measure for Scholarly Publications: A Proof of Concept},
booktitle = {Proceedings of iConference 2024: Wisdom, Well-Being, Win-Win},
year = {2024},
doi = {10.1007/978-3-031-57867-0_31},
publisher = {Springer Nature Switzerland},
address = {Cham},
pages = {409--420},
isbn = {978-3-031-57867-0},
}