/emoji_finder

emoji_finder

Primary LanguagePythonMIT LicenseMIT

Semantic Emoji Search

-----> Try the Live Web App <-----

If you're a Mac user who uses (or wants to try) LaunchBar, try my Semantic Emoji Finder Launch Bar Action.

Inspired (nerd sniped?) by this post on Mastodon, I have created this effort to do semantic searching for emoji. So, you can search for flower, and also get bouquet 💐, and cherry blossom 🌸. (The iOS emoji keyboard does something similar, but this remains unavailable on MacOS.)

I'm using the python sentence_tranformers package available from SBERT. This has a variety of pretrained models suitable for the task of finding a semantic match between a search term and a target. I'm using the all-mpnet-base-v2 model for the web apps.

The web app now functions in two ways. The first is to precompute everything and store results for one-word search queries in sqlite. This uses precomputed semantic distance against a corpus of common english words (now 40,000 words). The top results are stored in a database in all-mpnet-base-v2_main.db along with lookup tables, indices, and views that make looking up a word a simple sql query.

For longer queries, I used the precompute.py file to generate all-MiniLM-L6-v2 vectors which I store in the duckdb database vectors.db, now that DuckDB supports fixed size arrays. Then the LiveSearch class can use llama.cpp to create a vector for the search term (thanks to new BERT support in llama.cpp) and DuckDB can find the most similar emojis using cosine similarity with a short query:

  select id,array_cosine_similarity(arr,?::DOUBLE[384]) as similarity,e.* from array_table a left join emoji_df e on a.id = e.idx where label = base_emoji order by similarity desc limit 25;"

The dash app also includes a 2D projection of the sentence_transformer vectors via UMAP. This shows the emojis as they relate to each other semantically. This is limited to 750 emoji on the graph at once, but more will appear as one zooms in on the plotly graph. Clicking on an emoji will display it with a button to copy to the clipboard.

TODO:

  • Add other preferences like filtering max emoji version.
  • Enhance the encoded text for emoji? A person with a laptop is called a "technologist"; if that had a better description, the search would be better at finding it. I'd need some alternate description info, however, not in the python emoji library