poligrounder: A Scala repository from benwing

Intermediate-file format:

We have an intermediate-file format as follows.  Corpora in an arbitrary
source format are converted directly into this format.  The intermediate
format contains two types of files.  One contains all the data for a
document, including both the text and the metadata, specified as a
number of fields, separated by a tab character.  There may potentially be
a number of these files, if there is a large amount of data.  The other type
of file is a very small "schema" file, describing the fields in the data
files.

The data files contain one document per line, for ease in processing
with Hadoop -- even though that may potentially produce very large lines.
The text in these files should be pre-split into words, since the algorithm
for word-splitting will in general be specific to the corpus.  The text
should be in raw UTF-8 format, listed as a series of words separated by
spaces.  Because of the pre-splitting, there should be no words containing
spaces, tabs or newlines, and hence no need to escape such characters to
prevent them from being interpreted as separators.

The data file contains a number of "fixed" fields that are common across all
corpora and have well-defined semantics, as well as some "variable" fields
that are specific to each corpus.  The order of the fields is not too
important, since the schema file indicates which field occurs where; however,
it is convenient for viewing purposes to place the field containing text
last, and the other fixed fields first.  The defined fixed fields are:

1. "corpus": Corpus.  This could be e.g. "wikipedia-english-20111007" or
   "twitter-infochimps" or something, indicating the source of the documents.
2. "title": Document title.  This should be unique within a given corpus, but
   need not be unique across all corpora.  The combination of corpus and title
   uniquely identifies a document.  If there is only a unique ID, and no title,
   the ID can be copied into the title field.  If the title is not necessarily
   unique, the ID can be prepended.  Examples of cases where both titles and
   ID's exist are Wikipedia articles and Twitter users.
3. "id": Document ID.  Generally, this should also be unique within a given
   corpus.  This field is only used for identification currently.  If there
   isn't an ID separate from the title, either the title can be copied or an
   arbitrary number generated.
4. "group": Document group.  This is used for grouping documents into larger
   documents (e.g. grouping tweets by user).  This can be left blank in cases
   where no such concept exists.
5. "coord": Document coordinates.  This is a combination of two floating-point
   values, specifying latitude and longitude respectively (where positive
   indicates north and east, negative south and west), and separated by a
   comma.
6. "split": Document split.  This should be one of "training", "dev" or "test".
7. "text": Text, pre-split, separated by spaces, in UTF-8 format.
8. Additional fields, relevant to specific types of corpora.

The schema file, which describes the fields in the data file, is currently
simply a one-line file with each field name separated by a tab.  Fixed fields
need to be listed as well.

   Possibly we may add a second line specifying the field type.
   Possible field types might be string, integer, floating-point or boolean.
   (Booleans would be serialized as "yes"/"no", "true"/"false" or "on"/"off".
   On input we accept any of them.  If we deal with output, we either need
   to pick one or have different boolean types, e.g "yes-boolean", "on-boolean"
   or "true-boolean".) Either the types should be automatically nullable
   (i.e. they have an additional "unknown" value corresponding to a blank
   field), or there should be nullable versions of each type.

Counts-file format:

The data file in the above format can be processed to produce a file with
word counts instead of actual text.  The text field is replaced by a "counts"
field.  The counts file contains a series of entries separated by spaces.
Each entry contains one or more fields, separated by colons.  Because the
colon serves as a separator, it needs to be escaped when it occurs as a
character in a word (e.g. in a URL or emoticon occurring as a word).  To
make it easy to split accurately, the colon needs to be escaped into a
format that does not have a colon character in it (i.e. simply preceding by
a backslash won't work).  Instead, we URL-encode colons and percent signs,
i.e. colon becomes %3A and percent sign becomes %25.

Unigram files have entries of the form WORD:COUNT.  Bigram files will have
entries either of the form WORD1:WORD2:COUNT or WORD:COUNT (the latter
containing unigram counts, for backoff purposes).  If additional parameters
need to be specified, they should be given in a format such as
:PARAM:VALUE, with an initial colon.
benwing/poligrounder