/ingredient-phrase-tagger

Extract structured data from ingredient phrases using conditional random fields

Primary LanguagePythonOtherNOASSERTION

CRF Ingredient Phrase Tagger

This repo contains scripts to extract the Quantity, Unit, Name, and Comments from unstructured ingredient phrases. We use it on Cooking to format incoming recipes. Given the following input:

1 pound carrots, young ones if possible
Kosher salt, to taste
2 tablespoons sherry vinegar
2 tablespoons honey
2 tablespoons extra-virgin olive oil
1 medium-size shallot, peeled and finely diced
1/2 teaspoon fresh thyme leaves, finely chopped
Black pepper, to taste

Our tool produces something like:

{
    "qty":     "1",
    "unit":    "pound"
    "name":    "carrots",
    "other":   ",",
    "comment": "young ones if possible",
    "input":   "1 pound carrots, young ones if possible",
    "display": "<span class='qty'>1</span><span class='unit'>pound</span><span class='name'>carrots</span><span class='other'>,</span><span class='comment'>young ones if possible</span>",
}

We use a conditional random field model (CRF) to extract tags from labelled training data, which was tagged by human news assistants. We wrote about our approach on the New York Times Open blog. More information about CRFs can be found here.

On a 2012 Macbook Pro, training the model takes roughly 30 minutes for 130k examples using the CRF++ library.

Development

On OSX:

brew install crf++
python setup.py install

Quick Start

The most common usage is to train the model with a subset of our data, test the model against a different subset, then visualize the results. We provide a shell script to do this, at:

./roundtrip.sh

You can edit this script to specify the size of your training and testing set. The default is 20k training examples and 2k test examples.

Usage

Training

To train the model, we must first convert our input data into a format which crf_learn can accept:

bin/generate_data --data-path=input.csv --count=1000 --offset=0 > tmp/train_file

The count argument specifies the number of training examples (i.e. ingredient lines) to read, and offset specifies which line to start with. There are roughly 180k examples in our snapshot of the New York Times cooking database (which we include in this repo), so it is useful to run against a subset.

The output of this step looks something like:

1            I1      L8      NoCAP  NoPAREN  B-QTY
cup          I2      L8      NoCAP  NoPAREN  B-UNIT
white        I3      L8      NoCAP  NoPAREN  B-NAME
wine         I4      L8      NoCAP  NoPAREN  I-NAME

1/2          I1      L4      NoCAP  NoPAREN  B-QTY
cup          I2      L4      NoCAP  NoPAREN  B-UNIT
sugar        I3      L4      NoCAP  NoPAREN  B-NAME

2            I1      L8      NoCAP  NoPAREN  B-QTY
tablespoons  I2      L8      NoCAP  NoPAREN  B-UNIT
dry          I3      L8      NoCAP  NoPAREN  B-NAME
white        I4      L8      NoCAP  NoPAREN  I-NAME
wine         I5      L8      NoCAP  NoPAREN  I-NAME

Next, we pass this file to crf_learn, to generate a model file:

crf_learn template_file tmp/train_file tmp/model_file

Testing

To use the model to tag your own arbitrary ingredient lines (stored here in input.txt), you must first convert it into the CRF++ format, then run against the model file which we generated above. We provide another helper script to do this:

python bin/parse-ingredients.py input.txt > results.txt

The output is also in CRF++ format, which isn't terribly helpful to us. To convert it into JSON:

python bin/convert-to-json.py results.txt > results.json

See the top of this README for an example of the expected output.

Authors

License

Apache 2.0.