This codebase now includes several basic coherence models:
Code for:
- entity grid experiment
- entity graph experiment
Both are multilingual. They currently work for French, German and Spanish. For English, syntactic roles can be derived. This is not the case for French, German and Spanish. Therefore the grid/graph will only derive the entity occurances, not their syntactic roles. In the case of the graph, in particular, the best option then is to run with the weighted projection.
Code for a syntax-based model:
- implementation of local coherence model based on syntactic patterns from A.Louis(Louis and Nenkova, 2012)
- our own adaptation of it, which is a fully generative model based on IBM1. We learn a probability distribution over the alignments to better learn the patterns, instead of a uniform distribution.
=====================================================
To test existing models on an Machine Translation output, an entirely different, more challenging context than the one they are generally used in. Previously they have been used to assess coherence monolingually, in a clear-cut task (eg selecting best order from shuffled sentences of a coherent text) whereas lack of coherence can be caused in other more subtle ways.
We investigate these models for the task of measuring the coherence of machine translation output. This is a very different scenario: firstly, it is more subtle as the sudden breaks in transitions or shifts of focus apparent in the traditional monolingual test scenario are absent; and secondly, as the machine translated output MT sentences alone may be incoherent. It may contain other textual issues, such as ungrammatical fragments. And given the way translations are generated by standard MT systems, on a sentence-by-sentence basis, several coherence-related phenomena spanning sentence boundaries can be lost, leading to incoherent document translations (such as incorrect co-referencing, inadequate discourse markers, and lack of lexical cohesion).
===============================================
Entity grids are constructed by identifying the discourse entities in the documents under consideration, and constructing a 2D grids whereby each column corresponds to the entity, i.e. noun, being tracked, and each row represents a particular sentence in the document. Once all occurrences of nouns and the syntactic roles they represent in each sentence (Subject (S), Object (O), or other (X)) are extracted, an entity transition is defined as a consecutive occurrence of an entity, with given syntactic roles. These are computed by examining the grid vertically for each entity.
Entity graphs project the entity grid into a graph format, using a bipartite graph. They capture the same entity transition information as the entity grid experiment, although they only track the occurrance of entities, avoiding the nulls of the other, and additionally can track cross-sentenial references. The graph tracks the presence of all entities, taking all nouns in the document as discourse entities,and connections to the sentences they occur in. The coherence of a text in this model is measured by calculating the average outdegree of a projection, summing the shared edges (ie of entities leaving a sentence) between 2 sentences. There are 3 types of graph projections: binary, weighted and syntactic. Binary projections simply record whether sentences have any entities in common. Weighted projections take the number of shared entities into account, rating the projections higher for more shared A syntactic projection includes syntax information,where syntactic information is used to weight the importance of the link by calculating an entity in role of subject(S)as a 3,an entity in role of object (O) as a 2, and other (X) as a 1.
Syntax models Louis and Nenkova (2012) create a coherence model based on syntactic patterns. They take the syntax patterns extracted from documents marked up with parse trees, The local model holds that in a coherent text, consecutive sentences will exhibit syntactic regularities. Particular patterns may prove typical to specific discourse types and identify the ‘intentional discourse structure'. We examine the syntactic structure of sentence pairs, to establish any patterns. This done by computing the most frequent syntactic productions that occur in adjacent sentences. We initially work with parse tree productions, investigating pairs of syntactic items.
=====================================================================================================================
Training data (for syntax model and entity grid) and test data:
This needs to have document breaks, either xml tags or other for plaintext. The wmt data often contains incorrect xml, so breaks when run (check for & chars etc first before running EntityExperiments with xml flag. might take few runs). EntityExperiments can output grids in doctext format, i.e. separated with “# docid”.
You can also the use python scripts: see https://github.com/karins/CoherenceFramework/blob/master/python/discourse/README.md and eg to get doctext from multiple docs: cat list_of_files.txt | python -m discourse.doctext > corpus.doctext
To get ptb trees for training syntax models, can use eg: python -m discotools analysis parsedocs --jobs 30 docs/newstest2012.cs-en.ref trees/newstest2012.cs-en.ref
are derived in java code- using Stanford Parser. These can be derived from an entire directory containing texts of concatenated documents.They can be output in one file concatanating all grids, by using EntityExperiments as entrypoint. These can be in xml format (with tags), or plain text separated via ‘#docid’ style breaks. Once grids have been constructed, the transition probabilities need to be derived. This was done in java too (The original, discriminative version of the grid is in java. It derives the relative probabilities), but can now run now in Python- this is faster option for the generative grid. The derived probabilities can be used to test the coherence of new documents (grid_decoder.py).
- does not require training- this metric can be computed directly.
Can be run with various options, notably to determine the projection (one of syntactic, weighted, unweighted), and the language.
- Uses ptb trees as input. These can be derived via java (ParseTreeConverter) or python code
To summarize: To construct grids and run graph code (gets scores directly, no training), use EntityExperiments. Ensure correctly formatted data is now in: ● data/docs (doctext format of files) ● data/trees (which used the above and now contains ptb trees representations) ● data/grids (grid representations of these documents) These all have to be present before running the pipeline script.
EntityGridExtractor: creates grid from ptb input files. on linux java -classpath DiscourseFramework-1.0-jar-with-dependencies.jar:. nlp/framework/discourse/EntityGridExtractor "/experiments/data/" "English"
EntityExperiments: on windows java -classpath DiscourseFramework-1.0.jar:. nlp/framework/discourse/EntityExperiments "C:\inputfilelocation\" "English" "false" "false" "2" on linux java -classpath DiscourseFramework-1.0.jar:. nlp/framework/discourse/EntityExperiments "/experiments/data/test" "English" "false" "false"
stanford-corenlp-3.7.0.jar stanford-corenlp-3.7.0-models.jar Entity Grid
java -classpath DiscourseFramework.jar nlp/framework/discourse/EntityGridFramework inputfile "C:\inputfilelocation\" "English" "false" "false"
NB you can toggle the property in EntityGridFramework ("ssplit.eolonly" on line 87) in order to work with parallel documents. It is not set to true for build so that unit tests correctly sentence split.
=============================================================================================
If you use this code and find it helpful, please cite:
@article{ author = {Sim Smith,Karin and Aziz, Wilker and Specia, Lucia}, title = {{Cohere: A Toolkit for Local Coherence}}, booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)}, year = {2016}, }