/summary

This repo is used for our Summarization team for LING 575 Course at UW Seattle.

Primary LanguageCApache License 2.0Apache-2.0

Summary Project

This repo is used for our Summarization team for LING 575 Course at UW Seattle.

Setup

To set up the project environment, follow the steps below:

  1. Navigate to the "setup" folder using the command line:

    cd setup
  2. Change the permission of the create_env.sh script to make it executable:

    chmod +x create_env.sh
  3. Run the create_env.sh script to create the conda environment:

    ./create_env.sh
  4. Activate the newly created environment:

    conda activate Summary

Components

  • Preprocessing

    • File: src/doc_processor/doc_processing.py
    • This file contains functions and code for preprocessing raw data, including cleaning, formatting, and transforming the data.
  • Content Selection

    • File: src/model/content_selector.py
    • This file contains methods for selecting salient sentences for extractive summarization.
  • Information Ordering

    • File: src/model/information_orderer.py
    • This file contains methods for coherently reordering selected content for inclusion in a summary.
  • Content Realization

    • File: src/model/content_realizer.py
    • This file contains methods for realizing ordered content as a summary.

Scripts

  • scripts/run_main.sh: This is a script that runs the system per the parameters found in config.json Usage:

    cd scripts
    ./run_main.sh
  • scripts/proc_docset.sh: This is a script that finds, collates, and tokenizes data stored in the corpora/ directory on patas. Note that data/ is not tracked by git given the size of the files. Usage:

      cd scripts
      ./proc_docset.sh "/mnt/dropbox/23-24/575x/Data/Documents/training/2009/UpdateSumm09_test_topics.xml"  "../data/training"  
      ./proc_docset.sh "/mnt/dropbox/23-24/575x/Data/Documents/devtest/GuidedSumm10_test_topics.xml"  "../data/devtest"
      ./proc_docset.sh "/mnt/dropbox/23-24/575x/Data/Documents/evaltest/GuidedSumm11_test_topics.xml"  "../data/evaltest"
      ```
    

Configs

All arguments for the system are passed through the config file (config.json):

  • Primary Task

    • "document_processing": identifies arguments associated with ingesting and processing the original ACQUAINT and ACQUAINT-2 files.

      • "data_ingested": if any of the values are set to true, the system will load cached data from data/
      • "input_xml_file": identifies the location on patas for the XML files which identify the training, devtest, and evaltest documents.
      • "output_dir": identifies the directories in which to write out preprocessed files from the corpora
    • "model": arguments associated with the core summarization features of the system.

      • "content_selection": arguments associated with the content selection component.
        • "approach": identifies the content selection approach to use ("tf-idf", "textrank", "topic_focused", or "baseline").
          • "num_sentences_per_doc": max number of sentences to select from each document
          • "similarity threshold": (textrank only) minimum similarity score for inclusion in graph
          • "model_id": (textrank, topic_focused) for textrank and topic_focused, the transformers and sentence-transformers model identifier, respectively. Recommended models:
            • textrank: bert-base-cased, distilbert-base-cased, etc.
            • topic_focused: paraphrase-distilroberta-base-v1
      • "information_ordering": argument associated with the information ordering component.
        • "approach": identifies the information ordering approach to use ("random", "TSP", "entity_grid", "baseline").
          • "training_data_path": ("entity_grid") path to the directory containing training data
          • "all_possible_permutations_threshold": ("entity_grid") if selected sentences are under this threshold, calculate all permutations
          • "max_permutations": ("entity_grid") if not under threshold, the maximum number of permutations to search
      • "content_realization": argument associated with the content realization component.
        • "approach": identifies the information ordering approach to use ("simple", "advanced", "generative", "baseline").
          • "max_length": maximum number of word tokens allowed in output
          • "min_length": ("advanced") minimum word tokens required in output
          • "model_id": ("advanced", "generative") transformers or openai identifier for the desired model. Recommended models:
            • advanced: t5-small, t5-base, etc.
            • generative: gpt-3.5-turbo
          • "compression_method": ("advanced") defines the method associated with sentence compression must be "neural"
          • "do_sample": ("advanced") enables decoding strategies for next token selection
          • "temperature": ("generative") a number between 0 and 1 reflecting the sampling temperature for generative approach
          • "n": ("generative") number of desired responses from the OpenAI api.
    • "evaluation": identifies the evaluation metrics and associated output paths for results.

    • "output_dir": identifies the directory where the system summaries are written.

License

Distributed under the Apache License 2.0. See LICENSE for more information.

Authors

  • Ben Cote
    • Email: bpc23 at uw.edu
  • Mohamed Elkamhawy
    • Email: mohame at uw.edu
  • Karl Haraldsson
    • Email: kharalds at uw.edu
  • Alyssa Vecht
    • Email: avecht at uw.edu
  • Josh Warzecha
    • Email: jmwar73 at uw.edu