Deep-CENIC

This repository contains data and code of Deep-CENIC, a deep learning classifier for classifying ideational impact of research papers. In our Decision Support Systems paper: "Classifying the ideational impact of Information Systems review articles: A content-enriched deep learning approach", which provides further details on the data and model, Deep-CENIC was evaluated in the context of information systems (IS) review articles (RAs).

Goals

The repository was developed with reproducibility in mind, offering a starting point to reuse the following components:

Definitions and implementation of NLP features (see table in the following Data section)
Coding of ideational impact as a gold-standard for training and evaluating ML algorithms
Blueprint for a citation analysis repository structure (based on Cookiecutter Data Science)

Data

The dataset contains IS RAs published in major IS journals (Lowry et al. 2013) between 2000 and 2014. A forward search has been conducted on a sample of the original dataset comprising only citing publications, which cite a RA from the IS business value domain and appeared in a journal included in the Senior Scholars' basket of journals. The citation data was extracted from Google Scholar and Web of Science. A large scale coding of the ideational impact of IS RAs was conducted, which serves as the input for the Deep-CENIC model. The data is structured into three stages:

raw
interim
processed

Due to copyright restrictions some data has been removed from the repository or is presented in truncated form:

Raw TEI files: TEI files contain the full text of published papers and have therefore been removed. Refer to GROBID for details on how to generate TEI files from PDF papers.
raw/LR.csv, interim/LR.csv and interim/CP.csv contain paper abstracts which have been truncated.
interim/CITATION.csv contain citation sentences which have been truncated and only a random sample has been retained in the repository.

The final dataset (FEATURE_FRAME.csv) is fully available:

Feature	Description	Data Type	Source
citation_key_lr	Bibtex citation key	String	raw
citation_key_cp	Bibtex citation key	String	raw
focal_citations	Number of citations toward the focal RA	Integer	summarize_citation_df
textual	Number of textual citations	Integer	is_textual_citation
separate	Number of standalone citations	Integer	is_separate
comp_sup	Number of comparative and superlative clauses	Integer	has_comp_sup
prp	Number of personal pronouns	Integer	has_1st_3rd_prp
pos_0	Appearances of POS pattern 0	Integer	find_pos_patterns
pos_1	Appearances of POS pattern 1	Integer	find_pos_patterns
pos_2	Appearances of POS pattern 2	Integer	find_pos_patterns
pos_3	Appearances of POS pattern 3	Integer	find_pos_patterns
pos_4	Appearances of POS pattern 4	Integer	find_pos_patterns
pos_5	Appearances of POS pattern 5	Integer	find_pos_patterns
sentence_popularity amin	Number of different citations in the citing sentence (min aggregate)	Integer	get_popularity
sentence_popularity amax	Number of different citations in the citing sentence (max aggregate)	Integer	get_popularity
sentence_popularity mean	Number of different citations in the citing sentence (mean aggregate)	Double	get_popularity
context_popularity amin	Number of different citations in the citing context (min aggregate)	Integer	get_popularity
context_popularity amax	Number of different citations in the citing context (max aggregate)	Integer	get_popularity
context_popularity mean	Number of different citations in the citing context (mean aggregate)	Double	get_popularity
sentence_density amin	Focal citations divided by total citations in the citing sentence (min aggregate)	Double	get_density
sentence_density amax	Focal citations divided by total citations in the citing sentence (max aggregate)	Double	get_density
sentence_density mean	Focal citations divided by total citations in the citing sentence (mean aggregate)	Double	get_density
context_density amin	Focal citations divided by total citations in the citing context (min aggregate)	Double	get_density
context_density amax	Focal citations divided by total citations in the citing context (max aggregate)	Double	get_density
context_density mean	Focal citations divided by total citations in the citing context (mean aggregate)	Double	get_density
position_in_sentence amin	Position of the reference in citing sentence (min aggregate)	Double	get_position_in_sentence
position_in_sentence amax	Position of the reference in citing sentence (max aggregate)	Double	get_position_in_sentence
position_in_sentence mean	Position of the reference in citing sentence (mean aggregate)	Double	get_position_in_sentence
sentence_neg amin	Negative sentiment of the citing sentence with regards to the RA (min aggregate)	Double	get_sentiment
sentence_neg amax	Negative sentiment of the citing sentence with regards to the RA (max aggregate)	Double	get_sentiment
sentence_neg mean	Negative sentiment of the citing sentence with regards to the RA (mean aggregate)	Double	get_sentiment
sentence_neu amin	Neutral sentiment of the citing sentence with regards to the RA (min aggregate)	Double	get_sentiment
sentence_neu amax	Neutral sentiment of the citing sentence with regards to the RA (max aggregate)	Double	get_sentiment
sentence_neu mean	Neutral sentiment of the citing sentence with regards to the RA (mean aggregate)	Double	get_sentiment
sentence_pos amin	Positive sentiment of the citing sentence with regards to the RA (min aggregate)	Double	get_sentiment
sentence_pos amax	Positive sentiment of the citing sentence with regards to the RA (max aggregate)	Double	get_sentiment
sentence_pos mean	Positive sentiment of the citing sentence with regards to the RA (mean aggregate)	Double	get_sentiment
sentence_compound amin	Compound sentiment of the citing sentence with regards to the RA (min aggregate)	Double	get_sentiment
sentence_compound amax	Compound sentiment of the citing sentence with regards to the RA (max aggregate)	Double	get_sentiment
sentence_compound mean	Compound sentiment of the citing sentence with regards to the RA (mean aggregate)	Double	get_sentiment
context_neg amin	Negative sentiment of the citing context with regards to the RA (min aggregate)	Double	get_sentiment
context_neg amax	Negative sentiment of the citing context with regards to the RA (max aggregate)	Double	get_sentiment
context_neg mean	Negative sentiment of the citing context with regards to the RA (mean aggregate)	Double	get_sentiment
context_neu amin	Neutral sentiment of the citing context with regards to the RA (min aggregate)	Double	get_sentiment
context_neu amax	Neutral sentiment of the citing context with regards to the RA (max aggregate)	Double	get_sentiment
context_neu mean	Neutral sentiment of the citing context with regards to the RA (mean aggregate)	Double	get_sentiment
context_pos amin	Positive sentiment of the citing context with regards to the RA (min aggregate)	Double	get_sentiment
context_pos amax	Positive sentiment of the citing context with regards to the RA (max aggregate)	Double	get_sentiment
context_pos mean	Positive sentiment of the citing context with regards to the RA (mean aggregate)	Double	get_sentiment
context_compound amin	Compound sentiment of the citing context with regards to the RA (min aggregate)	Double	get_sentiment
context_compound amax	Compound sentiment of the citing context with regards to the RA (max aggregate)	Double	get_sentiment
context_compound mean	Compound sentiment of the citing context with regards to the RA (mean aggregate)	Double	get_sentiment
self_citation	At least one author of the RA and CA is identical	Boolean	is_self_citation
title_similarity	Semantic similarity of the RA and CA titles	Double	get_title_similarity
abstract_similarity	Semantic similarity of the RA and CA abstracts	Double	get_abstract_similarity
SYN	Knowledge developed in the cited RA includes synthesis	Boolean	raw
TT	Knowledge developed in the cited RA includes theory testing	Boolean	raw
TB	Knowledge developed in the cited RA includes theory building	Boolean	raw
RG	Knowledge developed in the cited RA includes identification of research gaps	Boolean	raw
CRI	Knowledge developed in the cited RA includes critical assessment	Boolean	raw
RA	Knowledge developed in the cited RA includes development of a research agenda	Boolean	raw
total_references	Total number of references in the CA	Integer	extract_total_references
total_citations	Total number of citations in the CA	Integer	extract_total_citations
weighted_citation_count	RA citations divided by total citations	Double	__main__
mention_positions_10	Number of citations in the first 10% of the paper	Integer	summarize_citation_df
mention_positions_20	Number of citations in the second 10% of the paper	Integer	summarize_citation_df
mention_positions_30	Number of citations in the third 10% of the paper	Integer	summarize_citation_df
mention_positions_40	Number of citations in the forth 10% of the paper	Integer	summarize_citation_df
mention_positions_50	Number of citations in the fifth 10% of the paper	Integer	summarize_citation_df
mention_positions_60	Number of citations in the sixth 10% of the paper	Integer	summarize_citation_df
mention_positions_70	Number of citations in the seventh 10% of the paper	Integer	summarize_citation_df
mention_positions_80	Number of citations in the eigth 10% of the paper	Integer	summarize_citation_df
mention_positions_90	Number of citations in the ninth 10% of the paper	Integer	summarize_citation_df
mention_positions_100	Number of citations in the tenth 10% of the paper	Integer	summarize_citation_df
heading_category_NA	Number of citations in the rest of the paper	Integer	summarize_citation_df
heading_category_intro	Number of citations in the introduction section	Integer	summarize_citation_df
heading_category_background	Number of citations in the background section	Integer	summarize_citation_df
heading_category_theory	Number of citations in the theory section	Integer	summarize_citation_df
heading_category_methods	Number of citations in the methods section	Integer	summarize_citation_df
heading_category_results	Number of citations in the results section	Integer	summarize_citation_df
heading_category_implications	Number of citations in the implications section	Integer	summarize_citation_df
heading_category_appendix	Number of citations in the appendix	Integer	summarize_citation_df
ref_in_title	Citation in the title of the paper	Boolean	check_ref_in_title
ref_in_heading	Number of citations in section headings	Integer	ref_in_heading
ref_in_figure_description	Number of citations in figure captions	Integer	ref_in_figDesc
ref_in_table_description	Number of citations in table captions	Integer	ref_in_tableDesc
USE	Ideational impact target variable	Boolean	raw

Setup

Download and install Docker
Clone this repository: git clone https://github.com/julianprester/deep-cenic.git
Build docker container: make dockerize
Run code: make run

Model

The focus of this repository is on developing and providing an ideational impact dataset. Thus, it does not include the machine and deep learning models trained on the data. For details regarding the standard implementations using the Keras and Tensorflow libraries refer to the paper and the figure below.

References

This repository is part of a broader research program, comprising the following work:

Schryen, G., Wagner, G., & Benlian, A. (2015). Theory of Knowledge for Literature Reviews: An Epistemological Model, Taxonomy and Empirical Analysis of IS Literature. In: Proceedings of the 36th International Conference on Information Systems, Fort Worth, Texas. link.
Wagner, G., Prester, J., Roche, M. P., Benlian, A., & Schryen, G. (2016). Factors Affecting the Scientific Impact of Literature Reviews: A Scientometric Study. In: Proceedings of the 37th International Conference on Information Systems, Dublin, Ireland. link.
Prester, J., & Wagner, G., & Schryen, G. (2018). Classifying the Ideational Impact of IS Review Articles: A Natural Language Processing Based Approach. In: Proceedings of the 39th International Conference on Information Systems, San Francisco, California. link.
Schryen, G., Wagner, G., Benlian, A., & Paré, G. (2020). A Knowledge Development Perspective on Literature Reviews: Validation of a new Typology in the IS Field. Communications of the Association for Information Systems, 46. link.
Schryen, G., Wagner, G., & Benlian, A. (2020). Distinguishing Knowledge Impact from Citation Impact: A Methodology for Analysing Knowledge Impact for the Literature Review Genre. Available at SSRN: link.
Hassan, N. R., Prester, J., & Wagner, G. (2020). Seeking Out Clear And Unique Information Systems Concepts: A Natural Language Processing Approach. In Proceedings of the 28th European Conference on Information Systems, Marrackech, Morocco. link.
Prester, J., & Wagner, G., Schryen, G., & Hassan, N. R. (2020). Classifying the Ideational Impact of Information Systems Review Articles: A Content-enriched Deep Learning Approach. Decision Support Systems, forthcoming. link.