/MOTS

MOTS (MOdular Tool for Summarization) is a summarization system, written in Java. It is as modular as possible, and is intended to provide an architecture to implement and test new summarization methods, as well as to ease comparison with already implemented methods, in an unified framework.

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

Build Status Docs Wiki License: GPL v3

MOTS

MOTS (MOdular Tool for Summarization) is a summarization system, written in Java. It is as modular as possible, and is intended to provide an architecture to implement and test new summarization methods, as well as to ease comparison with already implemented methods, in an unified framework. This system is the first completely modular system for automatic summarization and already allows to summarize using more than a hundred combinations of modules. The need for such a system is important. Indeed, several evaluation campaigns exist in AS field, but summarization algorithms are not easy to compare due to the large variety of pre and post-processings they use.

Getting Started

  • Javadoc available.
  • We provide an example corpus from TAC2009 in /src/main/resources and its associated human summaries.

Prerequisites

  • Maven - Dependency Manager
  • glpk-utils in order to use ILP (sudo apt-get install glpk-utils)
  • At least python3 in order to use WordEmbeddings, make sure to have an updated version of pip3 (sudo pip3 install --upgrade pip)
  • gensim in order to use WordEmbeddings (pip3 install gensim --user)
  • jep in order to use WordEmbeddings (pip3 install jep --user)

Installing

  • You might define $CORPUS_DATA to your DUC/TAC folder.
  • Install ROUGE :
    • Define $ROUGE_HOME to your ROUGE installation folder (or to ./ROUGE-1.5.5/RELEASE-1.5.5).
    • Install XML::DOM module in order to use ROUGE perl script. (sudo cpan install XML::DOM).
    • Run ./rouge_install.sh or :
      • Define $ROUGE_EVAL_HOME to $ROUGE_HOME/data.
      • Recreate database :
       cd data/WordNet-2.0-Exceptions/
       rm WordNet-2.0.exc.db # only if exist
       perl buildExeptionDB.pl . exc WordNet-2.0.exc.db
      
       cd ..
       rm WordNet-2.0.exc.db # only if exist
       ln -s WordNet-2.0-Exceptions/WordNet-2.0.exc.db WordNet-2.0.exc.db
      
  • Run install.sh script.

Usage

MOTS is a command line tool than can be used like this :

./MOTS mots.X.Y.Z.jar -c <config_file> -m <multicorpus_file> -v <OPTIONAL>

MOTS script encapsulate some environnement variable needed for the execution of WordEmbeddings. If you don't use WordEmbeddings you could launch via :

java -jar mots.X.Y.Z.jar -c <config_file> -m <multicorpus_file> -v <OPTIONAL>

Example config file and multicorpus file are provided in /conf but should be adapted to your setup.

Go Deeper

Each summarization process is defined in a configuration file and the test corpus is defined in a multicorpus configuration file.

Process configuration

Example for LexRank_MMR configuration file :

<CONFIG>
	<TASK ID="1">
		<LANGUAGE>english</LANGUAGE>
		<OUTPUT_PATH>doc/output</OUTPUT_PATH>
		<MULTITHREADING>true</MULTITHREADING>		
		<PREPROCESS NAME="GenerateTextModel">
			<OPTION NAME="StanfordNLP">true</OPTION>
			<OPTION NAME="StopWordListFile">$CORPUS_DATA/stopwords/englishStopWords.txt</OPTION>
		</PREPROCESS>
		<PROCESS>
			<OPTION NAME="CorpusIdToSummarize">all</OPTION>
			<OPTION NAME="ReadStopWords">false</OPTION>
			<INDEX_BUILDER NAME="TF_IDF.TF_IDF">
			</INDEX_BUILDER>
			<CARACTERISTIC_BUILDER NAME="vector.TfIdfVectorSentence">
			</CARACTERISTIC_BUILDER>
			<SCORING_METHOD NAME="graphBased.LexRank">
				<OPTION NAME="DampingParameter">0.15</OPTION>
				<OPTION NAME="GraphThreshold">0.1</OPTION>
				<OPTION NAME="SimilarityMethod">JaccardSimilarity</OPTION>
			</SCORING_METHOD>
			<SUMMARIZE_METHOD NAME="MMR">
				<OPTION NAME="CharLimitBoolean">true</OPTION>
				<OPTION NAME="Size">200</OPTION>
				<OPTION NAME="SimilarityMethod">JaccardSimilarity</OPTION>
				<OPTION NAME="Lambda">0.6</OPTION>
			</SUMMARIZE_METHOD>
		</PROCESS>
		<ROUGE_EVALUATION>
			<ROUGE_MEASURE>ROUGE-1	ROUGE-2	ROUGE-SU4</ROUGE_MEASURE>
			<MODEL_ROOT>models</MODEL_ROOT>
			<PEER_ROOT>systems</PEER_ROOT>
		</ROUGE_EVALUATION>
	</TASK>
</CONFIG>
  • <CONFIG> is the root node.
    • <TASK> represent a summarization task. You could do multiple in a simple run. At start, stick with one.
      • <LANGUAGE> is the input's document language for preprocessing goal. (english / french for now)
      • <OUTPUT_PATH> is the forlder's output path of the system. It is used to save preprocessed documents, ROUGE xml generated file, old score, ...
      • <MULTITHREADING> (boolean) launch the system in a mutltithreading way or not.
      • <PREPROCESS> is the preprocess step for the system. The preprocess java class to use is pass by the name variable. Here it's GenerateTextModel. It also needs two <OPTION> :
        • <OPTION NAME="StanfordNLP"> (boolean), true if you want to use StanfordNLP pipeline and tool to do the preprocessing.
        • <OPTION NAME="StopWordListPath"> (String), path of the stopwords list you want to use.
      • <PROCESS> is the main step of the system. It should have at least one <SUMMARIZE_METHOD> node and two <OPTION> It often has an <INDEX_BUILDER> node and a <CARACTERISTIC_BUILDER> node :
        • <OPTION NAME="CorpusIdToSummarize"> (String as a list of int separated by \t), the list of CorpusId to summarize from the MultiCorpus configuration file. "all" will do summarization for all corpus.
        • <OPTION NAME="ReadStopWords"> (boolean), state if the system count stopwords as part of the texts or not.
        • <INDEX_BUILDER> is the step where the system generate a computer friendly representation of each text's textual unit. (TF-IDF, Bigram, WordEmbeddings, ...)
        • <CARATERISTIC_BUILDER> is the sentence caracteristic generation step based on the textual unit index building.
        • <SCORING_METHOD> weights each sentences.
        • <SUMMARIZE_BUILDER> generate a summary usually by ranking sentence based on their score.
      • <ROUGE_EVALUATION> is the ROUGE evaluation step. For detail, look at ROUGE readme in /lib/ROUGE folder.
        • <ROUGE_MEASURE> (String as a list of int separated by \t), represent the list of ROUGE measure you want to use.
        • <MODEL_ROOT> is the model's folder name for ROUGE xml input files.
        • <PEER_ROOT> is the peer's folder name for ROUGE xml input files.

The <PROCESS> node is the system's core and you should look for more detail in the javadoc and the source code of the different INDEX_BUILDER, CARACTERISTIC_BUILDER, SCORING_METHOD and SUMMARIZE_METHOD class.

Multicorpus configuration

<?xml version="1.0" encoding="UTF-8"?>
<CONFIG>
	<TASK ID="1">
		<MULTICORPUS ID="0">
			<CORPUS ID="0">
				<INPUT_PATH>$CORPUS_DATA/TAC2009/UpdateSumm09_test_docs_files/D0901A/D0901A-A</INPUT_PATH>
				<DOCUMENT ID="0">.*</DOCUMENT>
				<SUMMARY_PATH>$CORPUS_DATA/TAC2009/UpdateSumm09_eval/ROUGE/models</SUMMARY_PATH>
				<SUMMARY ID="0">D0901-A.*</SUMMARY>
			</CORPUS>
		</MULTICORPUS>
	</TASK>
</CONFIG>

For now, all ID are useless and could be avoided.

  • <CONFIG> is the root node.
    • <TASK> represent a summarization task. You could do multiple in a simple run. At start, stick with one.
      • <MULTICORPUS> is a list of <CORPUS>
        • <CORPUS> can be one or more documents. The system will generate one summary per corpus.
          • <INPUT_PATH> is the folder containing the corpus' documents.
          • <DOCUMENT> is the regex for the documents you want to load. You could use multiple <DOCUMENT> node.
          • <SUMMARY_PATH> is the human summaries folder path.
          • <SUMMARY> is the regex for human summary file associating to this corpus. You could use multiple <SUMMARY> node.

Built With

  • Maven - Dependency Management

Authors

License

This project is licensed under the GPL3 License - see the LICENSE.md file for details

Acknowledgments

  • Thanks to AurĂ©lien Bossard, my PhD supervisor.