/ProCluster

Code for "Proposition-Level Clustering for Multi-Document Summarization" paper

Primary LanguagePythonApache License 2.0Apache-2.0

ProCluster

Code and Model for "Proposition-Level Clustering for Multi-Document Summarization" paper

--This repository is still under construction--

supervised_oie_wrapper directory is a wrapper over AllenNLP's (v0.9.0) pretrained Open IE model that was implemented by Gabriel Stanovsky. It was forked from here, and edited for our purpose.

You are welcome to try our demo. Look for the Multi-Doc Summary by Ernst et al skill.

How to generate summaries?

Preliminary steps

  1. Download the trained models from here, and put them in 'models' directory.

  2. Put your data in data\<DATASET>\ directory. (For example data\DUC2004\)

  3. Install requirements.txt (python 3.6)

  4. Create similarity matrix by SuperPAL (to be used for the clustering step):

    a. Clone SuperPAL repository.

    b. Move files from SuperPAL folder in this repository to the new SuperPAL repository.

    c. Follow the steps that appear in SuperPAL repository under 'Alignment model' section.

     Instead of step 2, run:

     python main_predict_inDoc.py -data_path <DATA_PATH>  -output_path <OUT_DIR_PATH>  -alignment_model_path  <ALIGNMENT_MODEL_PATH>
    

[Optional] 5. Follow this repository to install the official ROUGE measure.

Generating summaries

  1. Extract all Open Information Extraction (OIE) spans from the source documents:
  python extract_OIEs.py
  1. Prepare the data for the Salience model:
  python DataGenSalientIU_DUC_allAlignments_CDLM.py
  1. Predict salience score for each OIE span:
   cd transformers
   python run_glue_highlighter.py --model_name_or_path <MODEL_PATH>  --train_file <DATA_CSV_FILE_PATH> --validation_file <DATA_CSV_FILE_PATH>   --do_predict   --evaluation_strategy steps --eval_steps 250 --save_steps 250 --max_seq_length 4096 --gradient_accumulation_steps 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-5 --num_train_epochs 3 --output_dir <OUTPUT_DIR>

[Optional] 3*. Cluster salient spans, rank clusters, and select the most salient span to represent each cluster: ("Salience_prop + Clustering" model in Sec 4.3 in the paper)

 python deriveSummaryDUC.py
  1. Cluster salient spans and prepare data for the Fusion model:
 python prepare_fusion_data.py --salience_pred_file <SALIENCE_MODEL_SCORES_FILE> --output_summ_dir <OUTPUT_SUMMARY_DIRECTORY_PATH> --data_path <RAW_DATA_PATH> --sim_mat_path <SIM_MAT_PATH>

  1. Generate a fused sentence from every cluster:
 cd <PATH_TO_YOUR_TRANSFORMERS_DIR>\examples\seq2seq\
 python finetune_trainer.py --model_name_or_path=<MODEL_PATH> --learning_rate=3e-5  --do_predict --num_train_epochs=4 --evaluation_strategy steps --predict_with_generate --eval_steps=50 --per_device_train_batch_size=10 --per_device_eval_batch_size=10 --max_source_length=265 --eval_beams=6 --max_target_length=30 --val_max_target_length=30 --test_max_target_length=30 --data_dir <DATA_CSV_FILE_PATH> --output_dir <OUTPUT_DIR>
  1. Concatinate the fused sentences, and calculate final ROUGE scores:
 python deriveSummaryDUC_fusion_clusters.py