Two New Datasets for Italian-Language Abstractive Text Summarization

Summarization with T5 and Mbart

This repo contains the code for the experiments on training T5 and MBart models on italian language.

Models & Results

Datsset Model Rouge 1
MLSum IT5-Base
MBart 19.35
Fanpage IT5-Base
MBart 36.50
IlPost IT5-Base
MBart 39.91

Datasets and models

Datasets and models are available on Huggingface on ARTeLab.

Eval Pegasus with translations

We used google/pegasus-cnn_dailymail and google/pegasus-xsum as existing comparisons by translating the input to english with Helsinki-NLP/opus-mt-it-en and the output to intalian with Helsinki-NLP/opus-mt-en-it.

CUDA_VISIBLE_DEVICES="2" nohup python src/metrics_huggingface_eng_model.py \
        --model google/pegasus-cnn_dailymail \
        --path "./Data/IlPost/test.csv" \
        > logs/pegasus-cnn_dailymail_ilpost.log  2>&1 &

Eval Our models (or any italian summarization models from HugginFace)

It is possible to use this script to run a trained model on a custom file.csv to make simple comparisons.

python src/metrics_huggingface_it_model.py \
        --path ./Data/MLSum/test.csv \
        --batch-size 5 \
         --model ARTeLab/it5-summarization-fanpage


  • Install requirements
pip install -r requirements-torch.txt
# we use a Nvidia RTX 5000 with 16GB of RAM

CUDA_VISIBLE_DEVICES="0,1,2" nohup python src/run_summarization.py \     
        --output_dir /home/super/Models/summarization_mlsum2 \
        --model_name_or_path gsarti/it5-base \
        --tokenizer_name gsarti/it5-base \
        --train_file "./Data/MLSum/train.csv" \
        --validation_file "./MLSum/MLSum/val.csv" \
        --test_file "./Data/MLSum/test.csv" \
        --do_train --do_eval --do_predict \
        --logging_dir tensorboard/mlsum2 \
        --source_prefix "summarize: " \
        --predict_with_generate \
        --num_train_epochs 4 \
        --per_device_train_batch_size 2 \ 
        --per_device_eval_batch_size 2 \ 
        --overwrite_output_dir \
        --save_steps 500 \
        --save_total_limit 3 \
        --save_strategy="steps" \
        --max_source_length 512 --max_target_length 64 \
        > logs/mlsum2.log  2>&1 &
# we use a Nvidia RTX 5000 with 16GB of RAM
PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:25" CUDA_VISIBLE_DEVICES=0 nohup python src/run_summarization_mbart.py args.json > logs/mbart-fanpage2.log 2>&1 &


More details and results in published work

