Parex Paraphrase Extractor -------------------------- If you use Parex in your work, please cite the following: Michael Denkowski and Alon Lavie, "METEOR-NEXT and the METEOR Paraphrase Tables: Improved Evaluation Support For Five Target Languages", Proceedings of the ACL 2010 Joint Workshop on Statistical Machine Translation and Metrics MATR, 2010 and Colin Bannard and Chris Callison-Burch, "Paraphrasing with Bilingual Parallel Corpora", Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, 2005 1. About: --------- Parex is a simple tool for extracting paraphrases from bilingual phrase tables using the Bannard and Callison-Burch [ACL-2005] method. In addition, language- independent paraphrase merging and filtering can be applied using information from the parallel corpora from which phrase tables are built. Parex assumes a server with at least 12GB of memory several GB of disk space. 2. Requirements for paraphrase extraction: ------------------------------------------ pt.gz - a gzipped Moses format bilingual phrase table fCorpus - the foreign corpus used to build pt.gz, pre-processed exactly as given to word alignment/phrase extraction nCorpus - the native corpus used to build pt.gz fTgtCorpus - a small foreign target corpus (10,000 sentences recommended) to be paraphrased. Paraphrases will be extracted for all possible phrases in this corpus. This can be `head -n 10000` of fCorpus, although material in the same domain as test data for your NLP task(s) can be more useful. nTgtCorpus - a small native target corpus 3. Extracting paraphrases: -------------------------- Given the above files, extract paraphrases: $ java -XX:+UseCompressedOops -Xmx12G -jar parex-*.jar <fCorpus> <nCorpus> \ <pt.gz> <fTgtCorpus> <nTgtCorpus> <outPrefix> [minTP] [minRF] [symbols] outPrefix - prefix for output files The last three parameters are optional: minTP - minimum translation probability for phrase pairs. Any phrase pairs with lower probability will not be considered during paraphrase extraction. (default 0.001) minRF - minimum relative frequency in corpus to be considered a common word. Any phrases with one or both sides consisting of only common words will not be considered during paraphrase extraction. (default 0.001) symbols - string of punctuation symbols. Phrase pairs containing punctuation symbols will not be considered during paraphrase extraction. (default "~`!@#$%^&*()-_=+[{]}\\|;:'\",<.>/?") The following files are produced: <pre>.n.common - native common word list <pre>.n.raw.gz - native, unsorted paraphrase instances <pre>.n.grp.gz - native, sorted paraphrase instances <pre>.n.par.gz - merged paraphrase table. This is the final paraphrase table if only one phrase table is being used Foreign files <pre>.f.* are foreign equivalents of above. Paraphrase tables contain lines in the following format: phrase1 ||| phrase2 ||| prob This indicates that phrase1 (reference phrase) can be paraphrased by phrase2 (paraphrase) with probability prob. Thus prob is the probability of the paraphrase given the reference. For example: day before ||| yesterday ||| 0.0175624491042 This indicates that "day before" can be paraphrased as "yesterday" with P(yesterday|day before) = 0.0175624491042. 4. Merging paraphrase tables: ----------------------------- To merge paraphrase tables built from multiple phrase tables: $ java -XX:+UseCompressedOops -Xmx12G -cp parex-*.jar MergeParaphraseTables \ <outPrefix> <par1.gz> <wc1> <par2.gz> <wc2> [par3.gz wc3 ...] par1.gz - paraphrase table build as in previous section wc1 - number of sentence pairs in copora used to create phrase table used to produce par1.gz par2.gz, wc2, ... - other tables to be merged, same as above Output: <pre>.mrg.par.gz - merged paraphrase table. Paraphrase probabilities are weighted means over phrase tables/corpora in which they appear. Format is same as original paraphrase table. 5. Filtering paraphrase tables: ------------------------------- To filter extracted or merged paraphrase tables: $ java -cp parex-*.jar Vacuum <minProb> <phrasetable.gz> <new-phrasetable.gz> minProb - minimum paraphrase probability. Discard anything with lower probability. (0.01 recommended) phrasetable.gz - input paraphrase table new-phrasetable.gz - output (filtered) paraphrase table