/summarization-dataset

Multimodal summarization dataset for Russian

Primary LanguagePython

summarization-dataset

Multimodal summarization dataset for Russian

Structure

At the moment, the dataset contains 480 papers from 8 scientific domains: linguistics, history, law, medicine, journalism, computer science, economics, chemistry.

Each paper in the dataset occupies one folder, which contains the following files:

  • name.txt - name of the paper
  • abstract.txt - its abstract
  • text.txt - its full text
  • image_number.png - figures
  • table_number.png - tables
  • figures.json - descriptions of figures
  • tables.json - descriptions of tables

Statistics

Domain Length (chars) Length (tokens) Figures Tables
Economics 1 316 995 151 284 32 25
Chemistry 938 743 109 859 159 150
History 1 540 251 184 407 2 17
IT 1 002 115 114 721 238 27
Journalism 1 377 087 174 064 45 12
Law 1 243 153 143 675 0 2
Linguistics 1 557 481 190 478 1 1
Medicine 963 178 107 449 19 45
Total 9 939 003 1 175 937 496 279

LLMs usage

We tested the following LLMs: GigaChat, YandexGPT and GPT-3.5 Turbo. The code for running these models can be found by this link: link to Colab Notebook

Citation

Alena Tsanda and Elena Bruches. Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers. arXiv:2405.07886