Fuzzy and quantitative ClueDB lookup

Question

Fuzzy and quantitative ClueDB lookup

Closed this issue 8 years ago · 3 comments

The ClueDB currently provides an API for getting potential answers of a given length for an exact clue match. The length is a hard constraint, but matching clues should not be.

A more general version would be useful that allows fuzzy / partial matches of some kind, and returns an estimated confidence/likelihood for each potential answer (based on the number of times that answer has occurred, the number of possible choices, the closeness of the clue match). Ideally this more general ClueDB would have a tuning parameter that controls the strictness of "matches", with one extreme being "exact matches only" (the current behavior) and the other extreme being "all words of the correct length".

Brute-force lookup for clues within a certain Levenshtein distance might work. This paper may or may not have some other useful ideas:

Abernathy, C., & Bao, L. (2012). CRIS: Natural Language Processing of Crossword Clues, 1–8. (https://github.com/lisabao/cris)

Answer 1 · 2016-02-27T20:44:40.000Z

I might take this on next once I finish serialization - since I just wrote ClueDBSolver (and it works well for the exact-match case) a good next step would be to add near-matches.

Find nearly-matching answers and evaluate how closely they match
Use nearly-matching answers to solve crosswords, ignoring how well they matched and treating them all as perfect matches (this should happen almost automatically)
Take into account the goodness of each possible answer when constructing crossword solutions

Answer 2 · 2016-03-09T01:50:49.000Z

Here is one library for fuzzy matching: https://pythonhosted.org/ngram/ngram.html

but it uses an extra ~3 GB of memory for our ClueDB.

Others:

Answer 3 · 2016-03-16T23:17:24.000Z

Closing for now. The fuzzy N-gram lookup is slow and needs improvement, but it's done once at the beginning so it's workable.

Another important follow-up will be to investigate whether the fuzzy lookup is better or worse than exact match at producing good solutions.