/fastsubs

Generate most likely substitutes for words in a given text based on an n-gram language model.

Primary LanguageCOtherNOASSERTION

FASTSUBS		            Copyright (c) 2012-2014 Deniz Yuret

Usage: fastsubs [-n <n> | -p <p>] model.lm[.gz] < input.txt

Fastsubs is a program that finds the most likely substitutes for words
in input.txt using the language model in model.lm (ARPA format).  The
number of substitutes is controlled by two parameters: -n specifies
the number of substitutes for each word, -p specifies a threshold on
the sum of the substitute probabilities.  The program stops generating
substitutes for a word when either limit is satisfied.  The output has
an input word and its substitutes with the logarithm (base 10) of
their unnormalized probabilities in the following format:

word <tab> sub1 <space> lprob1 <tab> sub2 <space> lprob2 <tab> ...

Please see the file LICENSE for terms of use.  A paper with the
description of the algorithm and other related resources can be found
at http://goo.gl/jzKH0.  The code is mostly C99 compliant and has been
compiled and tested on a linux system with gcc.  The few GNU
extensions used are optional and can be turned off by editing the
beginning of dlib.h.

Update of Jan 9, 2014: The latest version gets rid of all glib
dependencies (glib is broken, it blows up your code without warning if
your arrays or hashes get too big).  It also fixes a bug: The previous
versions of the code and the paper assumed that the log back-off
weights were upper bounded by 0 which they are not.  In my standard
test of generating the top 100 substitutes for all the 1,222,974
positions in the PTB, this caused a total of 66 low probability
substitutes to be missed and 31 to be listed out of order.  Finally a
multithreaded version, fastsubs-omp is implemented.  The number of
threads can be controlled using the environment variable
OMP_NUM_THREADS.

fastsubs-omp can be used with a few additional run-time options:

Usage: fastsubs-omp [-n <n> | -p <p> | -m <max-threads> | -t | -z ] model.lm[.gz] < input.txt

-m specifies the maximum number of threads that can be used
-z specifies that the output probability values would be normalized probabilities that sum up to 1 (instead of the default unnormalized log probabilities)
-t specifies a one-target-per-text-line input format (instead of the default free text format). 
In this format every input line is: 
target_name <tab> target_id <tab> target_index <tab> text_line 
Then for every input line fastsubs-omp outputs two lines:
1) the input line
2) substitutes for the text_line[target_index] (i.e. the word in the target_index position in text_line) in the format:
sub1 <space> lprob1 <tab> sub2 <space> lprob2 <tab> ...


Other test and utility executables that can be built using the
Makefile:

* fastsubs-omp: Multithreaded version.

* wordsub: Takes the output of fastsubs and samples random substitutes
  for each word.

* normalize-subs.pl: Takes fastsubs output and converts the
  unnormalized log10(p) entries to normalized p entries that add up to
  1.0 (which may not be accurate if fastsubs did not output the whole
  vocabulary.)

* fastsubs-test: Takes a model file and interactively runs fastsubs on
  sentences typed by the user.

* lmheap-test: Takes a model file and tests the heaps in the lm data
  structure by interactively reading n-grams from the user and
  printing out the contents of the heap for each position.

* sentence-test: Takes a model file and tests the sentence and lm data
  structures by reading sentences from the user and printing out logp
  for each word.

* lm-test: Takes a model file and tests the lm data structure by
  reading n-grams from the user and printing out their logp and
  back-off weights.