Please refer to our paper entitled "Paraphrasing for Style" at COLING 2012 for more details. Please also cite our paper accordingly. @inproceedings{xu2012paraphrasing, title={Paraphrasing for Style}, author={Xu, Wei and Ritter, Alan and Dolan, Bill and Grishman, Ralph and Cherry, Colin}, booktitle={COLING}, pages={2899--2914}, year={2012} } ============================================================================================ python/crawl_{plays|sonnets}.py crawls the nfs.sparknotes.com website and downloads html python/scrapper_soupparser.py extracts original and modern english sentences from the HTML scripts/extract_lines.sh runs scrapper_soupparser.py on all the HTML files and puts output into format suitable for sentence alignment (in data/plays/align) scrapper2/ scripts and data for scrapping parallel texts from www.enotes.com data/align/plays/ from nfs.sparknotes.com data/align/plays2/ from www.enotes.com data/align/plays/notmerged data/align/plays2/notmerged Contains the aligned sentences at the level of HTML pages data/align/plays/merged data/align/plays2/merged All sentences merged together into plays. The bilingual sentence aligner doesn't really seem to work with very small files (e.g. the HTML pages). Presumably it should be possible to do a better job with the smaller files, since they are more tightly aligned, but I don't think this is the type of data the sentence aligner was designed for. Anyway, it seems to work fine on the merged plays. Out of 31,718 original lines the sentence aligner finds 21,079 alignments. It may be possible to improve on this... Also note: in the merged plays the pages are in random order. I think this probably shouldn't make much difference, but might be worth fixing... data/align/plays/model_16plays Moses model trained on 16 plays (except R&J), following the instructions on http://www.statmt.org/moses_steps.html moses.ini moses-bin.ini The two above use same parameters but provide different outputs (?) ./tuning/ I put R&J over there, planning to split into dev and test. data/test_small Wei_facebook.txt A small set of Facebook posts collected by hand from my Facebook. 03_natural_tweets.pl natural_tweets_10000.txt Made an attempt to find 'meaningful' Tweets by heuristic rules, e.g. propotion of words in dictionary etc. data/shakespere.dict A Shakespeare Glossary from http://www.william-shakespeare.info/ scripts and html pages used: dictionary/*.htm python/crawl_dictionary.py python/scrap_dictionary.py mert/ Contains held out dev data from Romeo & Juliet for discriminative training. mert/mert-work/moses.ini contains tuned weights for the 16 plays model eval/ Files for evaluating BLEU. Below are some results from the 16 plays model with/without MERT: First note that the 2nd translation (enotes.com) appears to be much closer to the original text than the first (sparknotes): -bash-4.1$ ~/mt/mosesdecoder/scripts/generic/multi-bleu.perl ascii.romeojuliet_tokenized_lower_original < ascii.romeojuliet_tokenized_lower_modern.1 BLEU = 24.67, 56.5/30.2/18.8/11.5 (BP=1.000, ratio=1.047, hyp_len=4312, ref_len=4120) -bash-4.1$ ~/mt/mosesdecoder/scripts/generic/multi-bleu.perl ascii.romeojuliet_tokenized_lower_original < ascii.romeojuliet_tokenized_lower_modern.2 BLEU = 52.30, 75.9/57.7/46.0/37.1 (BP=1.000, ratio=1.044, hyp_len=4300, ref_len=4120) Current BLEU scores (looks like the larger LM has a big effect): 16and7plays_16LM.1 BLEU = 27.35, 59.8/33.3/21.0/13.4 (BP=1.000, ratio=1.041, hyp_len=4287, ref_len=4120) 16and7plays_16LM_mert.1 BLEU = 27.61, 59.9/33.8/21.4/13.5 (BP=1.000, ratio=1.039, hyp_len=4281, ref_len=4120) 16and7plays_16LM.2 BLEU = 52.68, 77.4/58.6/46.2/36.7 (BP=1.000, ratio=1.041, hyp_len=4289, ref_len=4120) 16and7plays_16LM_mert.2 BLEU = 53.27, 77.5/59.3/46.9/37.4 (BP=1.000, ratio=1.045, hyp_len=4306, ref_len=4120) 16and7plays_37LM.1 BLEU = 30.56, 61.1/36.1/24.2/16.3 (BP=1.000, ratio=1.044, hyp_len=4300, ref_len=4120) 16and7plays_37LM_mert.1 BLEU = 30.54, 60.7/36.0/24.3/16.4 (BP=1.000, ratio=1.048, hyp_len=4319, ref_len=4120) 16and7plays_37LM.2 BLEU = 57.80, 79.3/63.0/52.0/42.9 (BP=1.000, ratio=1.045, hyp_len=4305, ref_len=4120) 16and7plays_37LM_mert.2 BLEU = 57.32, 78.9/62.3/51.5/42.7 (BP=1.000, ratio=1.052, hyp_len=4335, ref_len=4120) 16plays_37LM.1 BLEU = 30.63, 61.1/36.1/24.2/16.5 (BP=1.000, ratio=1.032, hyp_len=4250, ref_len=4120) 16plays_37LM_mert.1 BLEU = 30.53, 60.8/35.9/24.2/16.4 (BP=1.000, ratio=1.030, hyp_len=4244, ref_len=4120) 16plays_37LM.2 BLEU = 55.54, 78.0/61.1/49.6/40.2 (BP=1.000, ratio=1.036, hyp_len=4267, ref_len=4120) 16plays_37LM_mert.2 BLEU = 56.02, 78.0/60.9/50.1/41.4 (BP=1.000, ratio=1.040, hyp_len=4285, ref_len=4120) singleword1.1 BLEU = 24.39, 56.4/29.8/18.6/11.3 (BP=1.000, ratio=1.047, hyp_len=4312, ref_len=4120) singleword1.2 BLEU = 51.51, 75.8/56.9/45.1/36.2 (BP=1.000, ratio=1.044, hyp_len=4300, ref_len=4120) ./models/Lexicons/singleword1/ A very simple (not sure about correctness) dictionary-based model, which can translate "No way excuse his disadvantages when we do bear" into "No way excuse his foils when we do bear". The phrase table P(o|m) based on a verbatim dictionary (1520 pairs) and the frequencies of the 'original' words in the 37 Shakespeare plays. E.g. - There are two entries of Shakespearean word 'abrook' in the dictionary 'abrook -> abide' and 'abrook' -> 'brook'. Extend it to 'modern -> original' phrase table, by considering the reverse direction (e.g. 'abide -> abrook' etc ), the identical word pair (e.g. 'abide -> abide' etc), and adding suffixes (e.g. 'abides -> abrooks', 'abideed -> abrooked', etc) - Look into the language model built on 37 original plays (37plays_tokenized_lowercased.original.1gram) and derive P(o|m) based on unigram conditional probability. E.g. abide ||| abide ||| 0.940088299386192 abide ||| abrook ||| 0.0599117006138077 brook ||| brook ||| 0.95608871056976 brook ||| abrook ||| 0.0439112894302401