Circa September 1, this repo contains python experiments that are moving toward converting the relatively unstructured ingredients
text records, found in recipe objects produced by recipe-scrapers, into structured objects.
These files contain commented-out usage examples in their if __name__ == "__main__
blocks-
- wordnet-exam.py
- walk_hn.py
Not many guardrails, here, just a pile of experimental code aimed at understanding some of the ingredients texts. More work remains to be done in parsing number-like strings, adjectives, stemming, etc. The main result here was to identify the principal word nouns of an ingredient specification.
Upcoming work would be-
- 1.1 extract and convert numeric quantities into real numbers
- 1.2 identify units and measures
- 1.3 pair those quantities and measures
- 2.1 ignore or remove recipes containing trademarked product ingredients (because they're often unlinkable to word nouns/phrases from similar recipes that do not contain that product)
- 3.1 combine quantities+measures+food nouns to see whether they remain coherent as ingredient specifications
What's done in recipe-normalize
-
common.py
- IsAFood: True if word has one of several hypernyms
- IsAWord - True if exists in wordnet
- gen_ingr_words(lines) -yield gen of word-like strings
- gen_ingr_lines(filename) -yield gen of lines from file, html-stripped
- ngrammer(str/list) -generate ngrams from str or list of str
- DocGramme -attempt at tree of ngram length vs frequency
- NGramTree/NGramTreeNode -like DocGramme, and unfinished
-
syntree.py: a cli util displays the hypernyms of a word
-
wordnet-exam.py:
- brandname_lexicon -makes lexicon of words containing or ending with R or TM, but not included if also a generic food noun
- flatten(toflatten) -yields nested lists into one flat generator
- hypernym_collector -counts common ancestors across all ingr words
- pos_collector -convert ingr lines to pos templates, to see the collapse
- count_ngrams -count occurrences of ingr ngrams
- inverted_hypernym_tree -collect ingr word hypernym trees, invert to shared root
- word_food_histogram -categ ingr words to [food, word, unkown], writes to files, gens histogram and counts
-
walk_hn.py:
- convert_ancestry_to_d3_hierarchy -transform output of word_ancestry_finder to d3.hierarchy compatible data structure (with 'value' and 'children' entries)
- word_ancestry_finder -produce dict tree of word-to-hypernym ancestry paths, with 'value' int counts, uses word_tree+hn_visit to recurse the hypernym tree
-
hier.html: d3js to display hyn hier from convert_ancestry_to_d3_hierarchy
-
gitta_clean.py: replaces numeric-ish strings with "quant" in attempt to run gitta with fewer permutations