Synonyms and related items
markwriter opened this issue · 8 comments
I am curious as to whether it would be appealing to you to add the capability to facilitate synonyms and/or related terms into the search. I would like to have the search engine return items for "manufacturing" and "mfg" automatically. We have a number of such abbreviations in our text. Another one that comes to mind is "California" and "CA", etc. Examples of related terms would be search term "sports" would yield "baseball", "football", etc.
Barring have Lifti do that natively, what would be your suggestion for implementing that outside of Lifti? The only thing that comes to mind for me is to parse the input search string, looking for synonyms and the like. I am hesitant to do that however, because we're utilizing the Lifti query language and it seems like it might be problematic to directly parse a string looking for "mfg" and directly substituting in "(mfg | manufacturing)". Is there any better way to add new conditions to a query term? I looked at the section about manually creating queries, but then doesn't that put me in the position of writing all the parsing from scratch?
Any guidance would be greatly appreciated.
Thanks @markwriter! If you're using the LIFTI query syntax and relying on the built-in query parser, I don't think there's going to be a simple way for you to do it right now. This is probably a good idea for something that should be built in.
Maintaining a list of synonyms within LIFTI itself wouldn't work, because as your comment hints at, synonyms are likely to be domain specific, and definitely language specific. The synonym information will need to be provided as part of the index set-up, i.e. something like:
var index = new FullTextIndexBuilder<int>()
.WithSynonyms(
// "true synonyms"
// "manufacturing", "mfg" and "creating" are all synonyms for each other
new Synonyms("manufacturing", "mfg", "creating"),
// "One-way synonyms" (word expansions?)
// * "football" and "baseball" are both synonyms for "sport", but not each another
// * searching for "sport" will return entries containing "sport", "football" or "baseball"
// * searching for "football" will only return entries containing "football"
// * searching for "baseball" will only return entries containing "baseball"
new OneWaySynonym("sport", "football", "baseball"))
Does it make sense, or is it over-complicated? I also wonder whether matching on a synonym should result in a lower score for the result, so results containing the exact word are still scored above synonym matches?
Actually, everything you say here is great- the "one way" synonyms make a lot of sense - great catch.
Same thing with the fact that the synonym list would be supplied by the Lifti consumer - I definitely imagined it that way, especially with all the the specialized domains out there.
The thing is, although I want my users to take advantage of the query language, I bet only 2% of them will. Maybe I should just default to assuming that they will only use the default search behavior (mine is configured to "or") and then I would be pretty safe in parsing the query and inserting synonyms directly into the query.
Additionally, we can curate our data better and remove abbreviations like "mfg" where possible.
Oh, one more thing - your last comment I find very intriguing - "I also wonder whether matching on a synonym should result in a lower score for the result"... Search result scoring can get very subtle. I wonder if taking such things (synonym matching v actual term matching) into consideration would make it much more complicated on your part? That kind of thing is why search engine building can be a full time endeavor. There is something about getting that weighting to look like you would expect it to look like that is just as much art as it is science. Maybe there could be something configurable in the indexbuilder where there is a match weight for the synonyms.
I had a simplifying thought:
A synonym would be like your first example: if the user defines "Mfg and manufacturing" as synonyms, they would be completely equivalent.
The one-way synonym would be something that I thought of as a "related item". In the domain at my company we have "class codes", and many of them are related by a topic. The lists would have to be completely user-defined. A search term that matches any item in the list "11111, 22222, 33333, 44444, 55555" would return hits for all the other ones as well. It would be fine if they would be equally weighted with other hits. In my case, I could see us setting up the index to NOT include related terms as a default, but it would be nice to be able to add a flag, such as a star, to indicate 'related terms', such as searching for 33333*.
In the case of the pure/true synonyms, I could see us using them by default, and possibly (but much less importantly) having a flag to exclude them.
I've done a bit of work on this over the last couple of evenings, the API is shaping up to look like this in use:
this.sut = new FullTextIndexBuilder<int>()
.WithThesaurus(
b => b
.AddSynonyms("happy", "joyous", "delighted")
.AddSynonyms("large", "big", "massive")
.AddHypernyms("vehicle", "car", "truck", "motorcycle")
.AddHypernyms("animal", "mammal", "bird", "reptile"))
.Build();
You configure thesauruses at a field level like this:
this.sut = new FullTextIndexBuilder<int>()
.WithObjectTokenization<TestObjectA>(
options => options
.WithKey(x => x.Id)
.WithField(
"ObjectAText",
x => x.Text,
thesaurusOptions: o => o.AddSynonyms("large", "big", "massive")))
.Build();
And the words provided to the thesaurus are run through the appropriate tokenizer, so if you are using stemming in your index, the synonyms will match correctly.
I've still got quite a few tests to write to make sure the behaviour is as I'm intending, but I think I'll bee able to get it in without any additional breaking changes in the v4 release. You can see the minimal tests so far.
Does this feel like it's in line what what you're expecting @markwriter?
My quick take on this is that this is exactly what would fit the requirements - the distinction between synonyms and hypernyms (new word to me!) is extremely useful.
In my case I would use the thesaurus options globally in the index, but adding the synonyms to each indexed property on a class would be easy to do.
This is a great addition to Lifti.
BTW, we are days away from going to production with Lifti as the search engine replacing the home-grown one we had before. Very exciting. The code we had in place up to this point evaluates the text each and every time a search is done. The average search takes around 10 seconds. Users are going to be so happy with the improvement that Lifti provides.
v4 is out now - if you eventually use the thesaurus feature, let me know how you get on and feel free to raise an issue if something's not working for you!
Thanks for letting me know - I'll try to pull it in and use it here in the next day or two. Very cool.
I've really enjoyed using the synonyms features - having it return matches for "doctor" and "physician" or "bike" and "bicycle" or "house" and "home" is fabulous.