This is the Github repo of the EMNLP 2021 Findings' paper, SciCap: Generating Captions for Scientific Figures.
SciCap a large-scale figure caption dataset based on Computer Science arXiv papers published between 2010 and 2020. SCICAP contained over 416k figures that focused on one of the dominent figure type - graphplot, extracted from over 290,000 papers.
This dataset aims to allow researchers to study how to use computational models to analyze and caption scientific figures at scale.
SciCap is only for non-commercial use, and is released under CC BY-NC-SA 4.0. By using SciCap, you agree to the terms in the license.
@inproceedings{hsu-etal-2021-scicap-generating,
title = "{S}ci{C}ap: Generating Captions for Scientific Figures",
author = "Hsu, Ting-Yao and
Giles, C Lee and
Huang, Ting-Hao",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.277",
pages = "3258--3264",
abstract = "Researchers use figures to communicate rich, complex information in scientific papers. The captions of these figures are critical to conveying effective messages. However, low-quality figure captions commonly occur in scientific articles and may decrease understanding. In this paper, we propose an end-to-end neural framework to automatically generate informative, high-quality captions for scientific figures. To this end, we introduce SCICAP, a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020. After pre-processing {--} including figure-type classification, sub-figure identification, text normalization, and caption text selection {--} SCICAP contained more than two million figures extracted from over 290,000 papers. We then established baseline models that caption graph plots, the dominant (19.2{\%}) figure type. The experimental results showed both opportunities and steep challenges of generating captions for scientific figures.",
}
You can dowload the SCICAP dataset here: Download Link (18.15 GB)
scicap_data.zip
├── SciCap-Caption-All #caption text for all figures
│ ├── Train
│ ├── Val
│ └── Test
├── SciCap-No-Subfig-Img #image files for the figures without subfigures
│ ├── Train
│ ├── Val
│ └── Test
├── SciCap-Yes-Subfig-Img #image files for the figures with subfigures
│ ├── Train
│ ├── Val
│ └── Test
├── arxiv-metadata-oai-snapshot.json #arXiv paper's metadata (from arXiv dataset)
└── List-of-Files-for-Each-Experiments #list of figure names used in each experiment
├── Single-Sentence-Caption
│ ├── No-Subfig
│ │ ├── Train
│ │ ├── Val
│ │ └── Test
│ └── Yes-Subfig
│ ├── Train
│ ├── Val
│ └── Test
├── First-Sentence #Same as in Single-Sentence-Caption
└── Caption-No-More-Than-100-Tokens #Same as in Single-Sentence-Caption
Data Collection | Does the figure have subfigures? | Train | Validate | Test |
---|---|---|---|---|
First Sentence | Yes | 226,608 | 28,326 | 28,327 |
First Sentence | No | 106,834 | 13,354 | 13,355 |
Single-Sent Caption | Yes | 123,698 | 15,469 | 15,531 |
Single-Sent Caption | No | 75,494 | 9,242 | 9,459 |
Caption w/ <=100 words | Yes | 216,392 | 27,072 | 27,036 |
Caption w/ <=100 words | No | 105,687 | 13,215 | 13,226 |
An actual JSON object from SCICAP:
{
"contains-subfigure": true,
"Img-text": ["(b)", "s]", "[m", "fs", "et", "e", "of", "T", "im", "Attack", "duration", "[s]", "350", "300", "250", "200", "150", "100", "50", "0", "50", "100", "150", "200", "250", "300", "0", "(a)", "]", "[", "m", "fs", "et", "e", "of", "ta", "nc", "D", "is", "Attack", "duration", "[s]", "10000", "9000", "8000", "7000", "6000", "5000", "4000", "3000", "2000", "1000", "0", "50", "100", "150", "200", "250", "300", "0"],
"paper-ID": "1001.0025v1",
"figure-ID": "1001.0025v1-Figure2-1.png",
"figure-type": "Graph Plot",
"0-originally-extracted": "Figure 2: Impact of the replay attack, as a function of the spoofing attack duration. (a) Location offset or error: Distance between the attack-induced and the actual victim receiver position. (b) Time offset or error: Time difference between the attack-induced clock value and the actual time.",
"1-lowercase-and-token-and-remove-figure-index": {
"caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .",
"sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."],
"token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
},
"2-normalized": {
"2-1-basic-num": {
"caption": "impact of the replay attack , as a function of the spoofing attack duration . ( a ) location offset or error : distance between the attack-induced and the actual victim receiver position . ( b ) time offset or error : time difference between the attack-induced clock value and the actual time .",
"sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "( a ) location offset or error : distance between the attack-induced and the actual victim receiver position .", "( b ) time offset or error : time difference between the attack-induced clock value and the actual time ."],
"token": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "(", "a", ")", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "(", "b", ")", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
},
"2-2-advanced-euqation-bracket": {
"caption": "impact of the replay attack , as a function of the spoofing attack duration . BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position . BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time .",
"sentence": ["impact of the replay attack , as a function of the spoofing attack duration .", "BRACKET-TK location offset or error : distance between the attack-induced and the actual victim receiver position .", "BRACKET-TK time offset or error : time difference between the attack-induced clock value and the actual time ."],
"tokens": ["impact", "of", "the", "replay", "attack", ",", "as", "a", "function", "of", "the", "spoofing", "attack", "duration", ".", "BRACKET-TK", "location", "offset", "or", "error", ":", "distance", "between", "the", "attack-induced", "and", "the", "actual", "victim", "receiver", "position", ".", "BRACKET-TK", "time", "offset", "or", "error", ":", "time", "difference", "between", "the", "attack-induced", "clock", "value", "and", "the", "actual", "time", "."]
}
}
}
Corresponding Figure: 1001.0025v1-Figure2-1.png
- contains-subfigure: boolean (check if contain subfigure)
- paper-ID: the unique paper ID in the arXiv dataset
- figure-ID: the extracted figure ID of paper (the index is not the same as the label in the caption)
- figure-type: the figure type
- 0-originally-extracted: original captions of figures
- 1-lowercase-and-token-and-remove-figure-index: Removed figure index and the captions in lowercase
- 2-normalized:
- 2-1-basic-num: caption after replacing the number
- 2-2-advanced-euqation-bracket: caption after replacing the equations and contents in the bracket
- Img-text: texts extracted from the figure, such as the texts for labels, legends ... etc.
Within the caption content, we have three attributes:
- caption: caption after each normalization
- sentence: a list of segmented sentences
- token: a list of tokenized words
In the paper, we used [NUM], [BRACKET], [EQUATION], but we decided to use NUM-TK, BRACKET-TK, EQUAT-TK in the final data release to avoid the extra problems caused by "[]".
Token | Description |
---|---|
NUM-TK | Numbers (e.g., 0, -0.2, 3.44%, 1,000,000). |
BRACKET-TK | Text spans enclosed by any types of bracket pairs, including {}, [], and (). |
EQUAT-TK | Math equations identified using regular expressions. |
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
SciCap is only for non-commercial use, and is released under CC BY-NC-SA 4.0. By using SciCap, you agree to the terms in the license.
This dataset uses data in the arXiv dataset. The arXiv dataset uses the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license for the metadata, which grants permission to remix, remake, annotate, and publish the metadata.