cosine-similarity-search

Search short texts by cosine similarity of n-gram chars.

This is a very simple example of finding cocktail names. The base corpus is a list of cocktail names, found in

/data/cocktailnames.txt

Prerequisite

Python 3.6

If you like to run this in an environment:

install poetry (https://pypi.org/project/poetry/)
change to cosine-similarity-search dir (where .toml file is located)

$ poetry update
$ poetry shell

Otherwise, install joblib, pandas and scikit-learn via pip.

Usage of scripts in /src

0_make_name_bi_grams.py
- load corpus and transform each cocktail name into a string of bigrams, i.e.: "Zombie" -> "zo om mb bi ie"
1_save_train_vectors.py
- load bigrams (saved in 0)
- fit bigrams into a vectorizer and transform into a matrix

0 and 1 could be merged into one script, but creating n-grams of chars on a bigger corpus can be time consuming, thus it might be better to seperate these scripts.

2_load_predict_test.py
- load vectorizer and matrix
- load cocktailnames
- create testset corpus and transform to bigrams
- use loaded train_vectorizer on new corpus
- calculate similarity
- iterate over best results

The 5 best results for 4 requests (3 deliberatly misspelled):

magerita
    Margarita 0.47
    Mai Tai 0.42
    Strawberry Margarita 0.42
    Whitecap Margarita 0.42
    Blue Margarita 0.4

tonik vodka
    Vodka And Tonic 0.71
    Long vodka 0.62
    Vodka Martini 0.53
    Vodka Russian 0.42
    Kamikaze 0.35

caiprinja
    Caipirinha 0.62
    Dark Caipirinha 0.52
    Caipirissima 0.45
    Irish Spring 0.43
    Casino 0.33

mai tai
    Mai Tai 0.99
    Hawaiian Cocktail 0.45
    Shanghai Cocktail 0.43
    Martinez Cocktail 0.39
    Masala Chai 0.37

Here, the vectorization and similarity calculation is done the "scikit way".

But you certainly can take a look at /src/simple_example.py to see how to create a vector and the cosine similarity calculation "by hand".

zushicat/cosine-similarity-search

cosine-similarity-search

Prerequisite

Usage of scripts in /src