/awesome-semantic-search

Semantic search with embeddings: index anything

awesome-semantic-search

In Semantic search with embeddings, I described how to build semantic search systems (also called neural search). These systems are being used more and more with indexing techniques improving and representation learning getting better every year with new deep learning papers. The medium post explain how to build them, and this list is meant to reference all interesting resources on the topic to allow anyone to quickly start building systems.

image

  • Tutorials explain in depth how to build semantic search systems
  • Good datasets to build semantic search systems
    • Tensorflow datasets building search systems only requires image or text, many tf datasets are interesting in that regard
    • Torchvision datasets datasets provided for vision are also interesting for this
  • Pretrained encoders make it possible to quickly build a new system without training
    • Vision+Language
      • Clip encode image and text in a same space
    • Image
      • Efficientnet b0 is a simple way to encode images
      • Dino is an encoder trained using self supervision which reaches high knn classification performance
      • Face embeddings compute face embeddings
    • Text
      • Labse a bert text encoder trained for similarity that put sentences from 109 in the same space
    • Misc
      • Jina examples provide example on how to use pretrained encoders to build search systems
      • Vectorhub image, text, audio encoders
  • Similarity learning allows you to build new similarity encoders
  • Indexing and approximate knn: indexing make it possible to create small indices encoding million of embeddings that can be used to query the data in milli seconds
    • Faiss Many aknn algorithms (ivf, hnsw, flat, gpu, …) in c++ with a python interface
    • Autofaiss to use faiss easily
    • Nmslib fast implementation of hnsw
    • Annoy a aknn algorithm by spotify
    • Scann a aknn algorithm faster than hnsw by google
    • Catalyzer training the quantizer with backpropagation
    • hora approximate knn implemented in rust
  • Search pipelines allow fast serving and customization of how the indices are queries
    • Milvus end to end similarity engine, on top of faiss and hnswlib
    • Jina flexible end to end similarity engine
    • Haystack question answering on text pipeline
  • Companies: many companies are being built around semantic search systems
    • Jina is building flexible pipeline to encode and search with embeddings
    • Weaviate is building a cloud-native vector search engine
    • Pinecone a startup building databases indexing embeddings
    • Vector ai is building an encoder hub
    • Milvus builds an end to end open source semantic search system
    • FeatureForm's embeddinghub combining DB and KNN
    • vespa knn-based managed retrieval engine
    • Many other companies are using these systems and releasing open tools on the way, and it would be too long a list to put them here (for example facebook with faiss and self supervision, google with scann and thousand of papers, microsoft with sptag, spotify with annoy, criteo with rsvd, deepr, autofaiss, …)