weaviate/contextionary

Use word occurrence when turning Concept Corpus to Vector

Closed this issue · 6 comments

Todos

  • Extract corpus vectorization into independent unit
  • Add unit tests for existing behavior
  • Extend with linear weighting strategy
    • read factor from config
  • extend *core.Vector with occurrence or retrieve occurence otherwise

Right now, each word is weighted equally, instead a rare word should have more influence than a common word.

What I would like to propose is a single function that calculates a weighted centroid based on an array input. Like: calculateWeightedCentroid(['my', 'name', 'is', bob]) (I'm agnostic regarding the naming and inner working of the function, it is pseudo-code that helps me make my point)

What I would propose the function to do is this;

  1. Remove stopwords (['my', 'name', bob])
  2. calculate a centroid based on vector positions where the occurrence of the word in the training corpus determines its weight.

This function can not only be used to store vectors objects but also to calculate the vector position Weaviate should search for when the Explore{} function is used.

Determining weight based on occurrence.

Let's say that: word-a occurs 12 times in the corpus and has vector position x=1, y=2 and word-b occurs 4 times and has vector position x=4, y=8. Normally, the centroid of ab would be: x=2.5, y=5. The problem with this is that it gravitates to the center of the C11y and makes the result more ambiguous. Therefore, I would like to propose to weigh the results where the more the word occurs, the less weight it gets (i.e., it is seen as less important).

I would like to use this issue to discuss potential calculations.

PS:
Regarding stopwords, maybe this functionality might result in adding back the stopwords because they have a function but they occur really often so their weight will be low.

A. The function

The function already exists it even works with nested arrays. I.e:

[
   [`my`, `name`, `is`, `Bob`],
   [`I`, `am`, `a`, `rocket`, `scientist`],
]

would first build the centroid of the first inner array, then the second, then build a centroid from both. The only thing it doesn't do yet, is take the occurrence into account.

This function is currently used both for vectorizing a concept as well as for vectorizing search terms.

B. The calculation formulas

I'd suggest to start with the simplest possible formula and then try to improve it. A simple proposal would be:

O = Occurrence
Omax = The most any word occurs in this Corpus
Omix = The least times any word occurs in the context
x = A constant we can tweak to weigh more aggressively or more lightly (between 0 and 1)

w = 1 - ( (O - Omin) / (Omax - Omin) * s )

With this the example you mentioned would end up with the following weights based on some x values:

x = 1 (extremely aggressive weighing)
w(word-a) = 1 - (12 - 4) / (12 - 4) * 1 = 0
w(word-b) = 1 -(4-4) / (12 -4) = 1

=> since x was extremely aggressive we ignored the common word entirely and only used the rare word

x = 0 (extremely weak weighing)
w(word-a) = 1 - (12 - 4) / (12 - 4) * 0 = 1
w(word-b) = 1 -(4-4) / (12 -4) * 0 = 1

=> since x was extremely weak, the result is the same as if we did no weighing whatsoever

x = 0.5 (average strength weighing)
w(word-a) = 1 - (12 - 4) / (12 - 4) * 0.5 = 0.5
w(word-b) = 1 -(4-4) / (12 -4) * 0.5 = 0.75

=> both have an effect, but the effect of word b is 50% stronger than word-a (it counts 0.75 times, whereas a would only count 0.5 times)

C. Stopwords

Your PS is essentially what I tested out in the beginning. Since I wasn't very happy with the results I tested removing stopwords alltogether and the results improved drastically. Paul confirmed that removing stopwords before training the c11y should in theory yield the best results. That's also what I could find in practice. So I'm hesitant to reintroduce them, but I'm open to test it out. If the results actually improve I'm easily conviced.

Regardless of the above, our stop word list is currently quite long. We could experiment with a shorter list instead.

Per bullet;
A. Super.
B. Would it be an idea to have people select functions? That we can multiple once in the future? So that one not only gives the nested array but also which calculation should run? We can start with 1 for now, but it might be something you want to keep in mind architecturally?
C. Aha, this does ring a bell. Are we removing stopwords before training?

B. Sure. Would the user enter the actual formula (like "a +b -c * d"), or would they select formulas which we prefill (like "linear", "log n", "impossibly difficult rocket science") etc.?

C. Yes, since 0.6.0 I believe. The argument is that the remaining words will have more explicit positions because they are not watered down by stopwords which would pull every word closer to (0,0, ..., 0). From my small sample size I can confirm that we saw matches the theoretical expectations :)

B. O no, definitely select from a set of options ("linear", "log n", "impossibly difficult rocket science")
C. Awesome

Closing as the first mechansim (linear weighting with configurable factor) is implemented and released. Feel free to open new issues for additional weighing mechanisms or changes.