summarization-dataset

Multimodal summarization dataset for Russian

Structure

At the moment, the dataset contains 480 papers from 8 scientific domains: linguistics, history, law, medicine, journalism, computer science, economics, chemistry.

Each paper in the dataset occupies one folder, which contains the following files:

name.txt - name of the paper
abstract.txt - its abstract
text.txt - its full text
image_number.png - figures
table_number.png - tables
figures.json - descriptions of figures
tables.json - descriptions of tables

Statistics

Domain	Length (chars)	Length (tokens)	Figures	Tables
Economics	1 316 995	151 284	32	25
Chemistry	938 743	109 859	159	150
History	1 540 251	184 407	2	17
IT	1 002 115	114 721	238	27
Journalism	1 377 087	174 064	45	12
Law	1 243 153	143 675	0	2
Linguistics	1 557 481	190 478	1	1
Medicine	963 178	107 449	19	45
Total	9 939 003	1 175 937	496	279

LLMs usage

We tested the following LLMs: GigaChat, YandexGPT and GPT-3.5 Turbo. The code for running these models can be found by this link: link to Colab Notebook

Citation

Alena Tsanda and Elena Bruches. Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers. arXiv:2405.07886

iis-research-team/summarization-dataset

summarization-dataset

Structure

Statistics

LLMs usage

Citation