Multimodal summarization dataset for Russian
At the moment, the dataset contains 480 papers from 8 scientific domains: linguistics, history, law, medicine, journalism, computer science, economics, chemistry.
Each paper in the dataset occupies one folder, which contains the following files:
name.txt
- name of the paperabstract.txt
- its abstracttext.txt
- its full textimage_number.png
- figurestable_number.png
- tablesfigures.json
- descriptions of figurestables.json
- descriptions of tables
Domain | Length (chars) | Length (tokens) | Figures | Tables |
---|---|---|---|---|
Economics | 1 316 995 | 151 284 | 32 | 25 |
Chemistry | 938 743 | 109 859 | 159 | 150 |
History | 1 540 251 | 184 407 | 2 | 17 |
IT | 1 002 115 | 114 721 | 238 | 27 |
Journalism | 1 377 087 | 174 064 | 45 | 12 |
Law | 1 243 153 | 143 675 | 0 | 2 |
Linguistics | 1 557 481 | 190 478 | 1 | 1 |
Medicine | 963 178 | 107 449 | 19 | 45 |
Total | 9 939 003 | 1 175 937 | 496 | 279 |
We tested the following LLMs: GigaChat, YandexGPT and GPT-3.5 Turbo. The code for running these models can be found by this link: link to Colab Notebook
Alena Tsanda and Elena Bruches. Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers. arXiv:2405.07886