Research Project Repo on How Datasets are Cited @ PURRlab, ITU Copenhagen
Install with pip3 install -r requirements.txt
.
This project allows you to build datasets of dataset mentions from papers published in https://proceedings.mlr.press/.
Background:
As part of my research project, I analyzed dataset mentions in CHIL 2022. You can refer to my report_logs or see the presentation for a tl;dr.
Code:
- ArticleOrganizer.ipynb : Runs first, selects target venues and downloads their contents locally. Generates
ResearchPapers.csv
. - ArticleAnalayzer.ipynb : Runs second, requires some configuration to know where in the text to look for research paper mentions. Generates
DatasetMentions_Unprocessed.csv
, a table which may be further annotated to includeDataset Identifier
andAccess
. - ArticleVisualizer.ipynb : Run after cleaning and annotating your
unprocessed
file to generate visualizations.
Data:
- data/ResearchPapers.csv : A table of research papers which have been downloaded and their respective venues.
- data/DatasetMentions_Unprocessed.csv : A table of research papers which have been downloaded and their respective venues. Dataset mentions are sorted by the paper and venue they occur in. The
Mention Style
andMention
column indicate the type of mention and how it occurs in the text. TheNotes
column is used to indicate the original context so that an annotator may validate and make corrections if necessary. - data/DatasetMentions_Processed.csv : A table of which has been manually annotated over
DatasetMentions_Unprocessed
. Redudant columns were merged and footnotes were replaced with URLs instead of numbers. The example used in this repository introduces theDataset Identifier
andAccess
columns for theArticleVisualizer.ipynb
visualizer.