Implement "concept" as a first-class citizen along-side Term.
kreeben opened this issue · 13 comments
The purpose of a concept is to give meaning to a word or cluster of words so that an aggregated concept can be built that describe either a paragraph or the document in its entirety.
In a corpus there are always fewer concepts than there are terms. Therefore, if you could compare concepts in vector space instead of terms, you would gain in querying speed.
In order to give new meaning to a word or cluster of words, more information has to be added to the equation than just the words.
It's a good thing then that concepts may be extracted from the context in which a word or sentence live, the context being the words or sentences that surround them.
Sounds like fun, right?
Can this also be used for synonyms?
This is instead of synonyms. "King" and "emperor" should both be part of the same (oppressive and undemocratic ruler) concept.
Concepts also span multiple terms, then you have to deal with disambiguation ;-)
I'll give a very very basic go and submit a PR
Also on that subject that's a reason to seperate the analyzer and tokenizer into seperate interfaces, especially in regards to concept identification where the concept can span/encapsulate multiple terms. Using an analyzer with a tightly coupled tokenization function makes this a nightmare
@jhashemi you're giving this a go?
I could tell you about my ideas but I'm not going to. You seem to have an itch. Just a basic proof of concept will do :)
Before i go deeper into it wanted to chat about knowledge base or ontology base or api based.
Maybe ill uml it out and push an architecture project.
Sounds good.
Some thought on this issue. When I added it I was thinking about (1) how to implement word2vec, simplified in the same way the vector space model is simplified in Resin and in Lucene. But also (2) how to produce word vectors at indexing time. What if you add one document at a time to your index, how would you then be able to produce word vectors? It seems not possible. So perhaps "concepts" or word vectors or sentiment analysis or whatever you want to call it is an operation you do on an existing index. The sentiment analysis operation could produce a new concept-based index that complements the term-based one.
The concept-based index would contain pointers into the term-based index which in turn has pointers into the postings and document store.
Having a concept-based index would mean you could make more directed lookups into the term-based index instead of large scans.
All in theory and a bit diffuse in my mind at the moment.
Also, we would need a new tree to represent words instead of just characters. Does it have to be a B+ tree? I mean sure, all devs should roll a B+ tree once in their lives, I guess. Maybe it'll be fun?
This looks pretty good: https://github.com/asengupta/BPlusTree/blob/master/BPlusTree/BTreeNode.cs
check out https://github.com/jhashemi/resin/tree/master/src/Resin.Analyses.Concept
ideally concepts are represented as graphs. Most definitely a separate concept index will be needed. A implementation of IVocabulary that depends on a resin index will be needed. This is very very rudimentary and untested, I just wanted to get my ideas to paper
Also for graphs, typically a sparse adjacency matrix implementation works best. with each axis being your node ID's and relationships established as 0 or 1. You can use a Bitmap Index to make traversal extremely fast.
I will check this out shortly. I ran through the code and it looked very promising.
This issue is still open but needs a new strategy because of the new type of index introduced here: 5f85425
Will be solved at a later time.