/entity-classifier

Classify entities into clusters via embedding vectors, using a given list of category names

Primary LanguagePythonMIT LicenseMIT

entity-classifier

Classify entities into clusters via a zero-shot approach using embedding vectors, using a given list of category names.

  • use an embedding to make vectors of entity names
  • use the same embedding to make vectors of category names
  • for each embedding, find the category that has a nearest vector
  • then can classify the entities, for presentation in logical groups

Approach

Compare words (labels) by examining how close are their encoded vectors:

  • the dot product of 2 normalised vectors = cosine Angle
  • cosine distance = 1 - v.w
    • smaller means closer

Dependencies

  • Python 3.11
  • pyenv - if on Windows use pyenv-win

Install

Switch to Python 3.11.6:

pyenv install 3.11.6
pyenv local 3.11.6

Setup a virtual environment:

./create_env.sh

Install SBERT and cornsnake via this pip command:

pip install -U sentence-transformers==2.2.2 cornsnake==0.0.26

Usage

python main.py <path to category list file> <path to entity names file> [threshold (number between 0 and 1)]

Example

To test:

./test.sh

OUTPUT:

CATEGORY: (unknown)
  entity ['Aardvark', 'Alpaca', 'Anaconda']
CATEGORY: animal
  entity ['Albatross', 'Alligator', 'Ant', 'Zebu']
CATEGORY: country
  entity ['Albania', 'Andorra', 'Angola', 'Austria', 'Bangladesh', 'Belgium']

The results are not perfect, but not bad considering this is a simple 'out of the box' solution.

Further improvements

Hierarchy of labels:

  • first, classify against a top-level list of labels
  • then, for each label, classify against that labels list of sub-labels

Increase accuracy:

  • take several embeddings per class and use their average for that class
  • try different embeddings, can get better results
  • try different distance measures from your library
  • consider tuning the embedding (for example, for the domain vocabulary of a particular industry or problem space)

References

My Medium article

Conference notes from ML Con Berlin 2023

SBERT: How to Use Sentence Embeddings to Solve Real-World Problems