clulab/processors

[odin] EmbeddingsResource should extend ExplicitWordEmbeddingMap

Opened this issue · 4 comments

Odin's EmbeddingsResource extends the deprecated SanitizedWordEmbeddingMap.

Tangential, but have we given any thought to instead using an ANN index (ex. annoy4s) for Odin?

In order to avoid the deprecation, the code below can be used. However, since an InputStream is being used, nothing is keeping track of whether this set of vectors has already been loaded for other purposes. To coordinate that, the OdinResourceManager needs to be interfaced with the WordEmbeddingMapPool. Someone would need to know the naming conventions used in both the classes to do this.

package org.clulab.odin.impl

import org.clulab.embeddings.{ExplicitWordEmbeddingMap, WordEmbeddingMap}
import org.clulab.scala.WrappedArray._

import java.io.InputStream

trait OdinResource

// for distributional similarity comparisons
class EmbeddingsResource(is: InputStream) extends  OdinResource {
  val wordEmbeddingMap = ExplicitWordEmbeddingMap(is, binary = false)

  def similarity(w1: String, w2: String): Double = {
    val scoreOpt = for {
      vec1 <- wordEmbeddingMap.get(w1)
      vec2 <- wordEmbeddingMap.get(w2)
    } yield WordEmbeddingMap.dotProduct(vec1, vec2).toDouble

    scoreOpt.getOrElse(-1d)
  }
}

Thanks for the snippet, @kwalcock .

Have you all talked about using an ANN index for a large set of embeddings? Since processors is still using static word embeddings, I am thinking n-gram embeddings could help to improve the relevance of multi-token matches.

As in approximate nearest neighbor? Some were used for ConceptAlignment in alignment/indexer/knn/hnswlib . Specifically, this library was used: hnswlib. Only individual strings were added to the index, so I suppose that's unigram. Are you wanting to pair the words and concatenate their vectors? I haven't heard of that mentioned in relation to processors.

Are you wanting to pair the words and concatenate their vectors?

No, I meant averaging summing and averaging element-wise. It seems my memory is mistaken, though, we don't currently support this kind of thing: simScore(ave(embedding(<tok-1>), embedding(<tok-1>))) > 0.6

As in approximate nearest neighbor?

Yes, an approximate nearest neighbors index.

Right now Odin token constraints support expressions like simScore("tiger") > 0.9 which retrieves the embedding for the current token being examined and calculates its cosine similarity with "tiger". Imagine if you wanted to use this pattern with a phrase like "tax attorney". Including embeddings for bigrams in some kind of in-memory store isn't very practical. An ANN index is one possible solution.

Larger context: I am thinking about extending Odin to support a new kind of Embedding-based NER (just a sketch below):

- name: "embedding-ner"
   label: ActionStar
   type: embedding
   # will compare available embeddings for n-grams of the specified sizes
   phrases: [1, 2, 3]
   pattern: |
     ave("Sylvester Stallone", "Arnold Schwarzenegger") > .9