henomis/lingoose

v0.0.11

Closed this issue · 4 comments

  • fix README example #117
  • change logo and refactor README #118
  • custom HTTP client support for Huggingface API #122
  • fix multiple load() in simpleVectorIndex #123
  • fix huggingface llm verbosity #124
  • fix directoryLoader validator #119
  • Refactor simpleVectorIndex internal structure: see here #126
  • indexes must be able to work with raw data structure not linked to documents. #127
  • refactor indexes methods Search and Query. #128
  • add a method to append a vector to an index #129
  • implement index retriever #130
  • implement cache #131
  • retriever implementation has the following issues: #132
    • is strictly linked to documents. A retriever is an helper to access to a an index.
    • who is in charge to load documents? index or retriever?
    • remove retriever
  • lint code #133
  • Add new QA pipeline mode refine #134
  • Fix: indexes may confuse cosine distance with cosine similarity #135
  • Update docs

SimpleVectorIndex refactor

The internal structure should be like this:

type data struct {
	ID       string     `json:"id"`
	Metadata types.Meta `json:"metadata"`
	Embedding embedder.Embedding `json:"embedding"`
}

Redis integration

create index

"FT.CREATE" "idx" "SCHEMA" "item_keyword_vector" "VECTOR" "FLAT" "10" "TYPE" "FLOAT32" "DIM" "768" "DISTANCE_METRIC" "COSINE" "INITIAL_CAP" "1000" "BLOCK_SIZE" "1000" "product_type" "TAG" "SEPARATOR" "," "item_name" "TEXT" "WEIGHT" "1.0" "item_keywords" "TEXT" "WEIGHT" "1.0" "country" "TAG" "SEPARATOR" ","

insert vector with medatada

"HSET" "product:pippo:pippo" "primary_key" "pippo" "product_type" "bike" "item_name" "bike" "item_keywords" "bike" "country" "Italy" "item_keyword_vector" "\xed$C?\xc9\xce\xf0>\x1a\xf1\xea>?\x81H?T\x95\xea>\\\x06\x1c?\t!p?\x11\x18\xdb>\xc2\xd1\x96>\x0e\x8di<\xdcn\x0f?\x9b\xaa\xb8<\x85=\x11?\xef\xf4\xc9>\x81R\xca=:1\xce;\x0e\xa8\xae>\xba\xdf\x86>B\x80,?\xd0\x81h>\x19\x88\x80>\x01\x1c\x01?\xe8\x12\xa9=\xa3\x00\x0b>Xo\xc5>5\xdd.?\x02g\xf7>\xd3\x01f>\x02\xd4b>\x91\x16\x1e>\xb2)\x12>eHN?\xf0\xa1(?H\xe4\xca>\x91i\x1f?\x9d\xfe\x8e>\xc2\x0b\xaf>\x95zQ?\x9f\xd4R?&\xe3\a?\xd3\x00`?\x00{5?HK+?7=\x95>F\x9c\xe6>\x9c\xa6m>\xbd\x88\xa9>\xe2\xbeY?&\xceR>8uj>'7x?\x95\x98\xf6>\x03\xe6\xba=\x8a*j?\xcf\xbd'?\xc1\x89\n?\x14\x91P?\xe0\xb8g?d\x164?\xa6b\xb3>\xc8\x1e\x97=['\x01?\xe1\xef3?\x9d\x85\\=X\x0cD=\xb1\xd6\x8e>\xa8\x1f3?*J\x1b?\xf9\xb5\x8b=)\x93\xbc>`\x85d?\xc9\x93n>\xf1\a\x8d>\xd1r\xd0>\x9c\\U?/U%?XVu=\xf2\x15 ?U\xd0b?\x00\x00\x14?\x19/\xa0>!\xaeS?\x92\x98O=nX%?L\x18T?\x02n\xac>y\xf8\xf5>\x1e\x10Y>*{\a=wLP?\x16\"\x9c=\xc7~\xfb=3\x1d\xf7>R\x85\xcd>k\x1d\xca>s\x88\xd3=\x83w\xf0>\xc0Y\x19?{\x9b\xee>\xe8\x05r?1jK?\xd2.1?\x0e\xdc5?\x9c\xc8.?\xd4\x99J?\xc6\xf1\x03?\xb4\xf5(>k\x94W?\xeb\x84)?!\x84q?dA\xc9=\xd8\x13\xd4>\xb9\x84H?ym\xde>\xea\xcf`?\x1a-*?\"h\xd7>\xe76n=;{/?\x13/\xf7;\t\xfc.?\xe1F!?w_\r?-\xe4U>dt\xe6>JZ\x12>*\xb3\x0b?Hq(?1~r?\x80A\xdf>UQ\x16?\x8f\"s?f\xdd]>\xb3gY?\xb6\xa6#>\xfc\xcb\x1d?$\xd2e?\x06\xf6\x02>\x91H\xf6>\x94\x90\xd0>O\xbcK?\xff\xeb1?\xa2K>?\x99\xee\xa1=\x1e\x7f??\x0c\x13\xcf>\xa9E\x82>\x91\xe6\x84=%\xb4\n?\xd8=^?\xa3PJ?\xfc\xdb|?\x8eVY>T^\x84>\x86\x1d\x05?&\xf9U>\x18#~?\x1b\xb5\x05?\x84x<>\xe8\xb8W?\xf12\xee>@sz=\x9d$\x0c>\xc3\xdej?\x99L\x7f>\x9e\xc4\x18?1\xa9F?\xa7\x9c\xbe>\x06\r\xcd=)=\xce=b3\xe1>1*\xee>8O\x89>\xf7\x0b@?\xf7\xd9R?N\x1c0>\x96\xb3\xd5>=n\xb1>\xfa\xde\t?7\x90s?\xed\x18R=\xf6f#?\xb9\xcfe?\xe9\x9a3?\xce\xbbe?!\x9e\xa3>\xec\x9c\x11>\x96[u?\xe7\x14u>\x8c\x86K>A\x85N>\x90\xd0\xa4>\x93\x17\x12?\xdcrg?}\xfb\a??m\x0c?=\xe0R?\x94\x11i?\xc6\x8e\x98>\xa2\x14\xf9>_\xa6\xf2>\xcf\x0e\xf3>`\xe5_>\xa5\x0bp?>\x9c\x98>\xff\xe2\x8f>\t\x0bO>x5\x03?'\xef\xd4>\x80\x14K?\x84\x89\xd6=\x86\xd1.?\xb8\xcf\xa9=\x81^e?l\xe8\xd2>}\x84\xb9>\x91x[=7\xe4+?\xddb\x85>\xa1\x8c\b>\xcf\xcb\xc5<\xf2\xcb\x04?\x9e\xf0\x9c>\x88W^?\x03jv?\xb7\x0b\xc1=\xe2\x03m?\xca\x1d>?\xea\x01\r?BO\x85>\x91\x15\x9e>\xf7\xa3\xf1>\x15\x9dJ?3\xed\t?\xee\x19\x9f>?@\xd9<\xa5\x98\x13>\xed\xb9\x05?\xd6e\xc6=y\x13\xc7>O\x87\xce>\x01\xb7\xc8>W\x99/?\xfc\xa0\x9e>\xff]\xeb>\xce\x92=>i\xb6~>\x1b\xef\xe6;\x8b\xa8\xb0>\xd6Q\x1b?\xa13P?|\xf5/>QrT>\xbf\x18\x06=\x01\xf3 ?\x8433?`@y?\xcd\xb6\xcb>\x03.O>\xe8\x12p>\xcdZz?\xee\xdb~?'76?\x9foC?r\xd0\x1a?\xe8\x8f)?\xa8\xfd\x94>\xf1\xe3D?Y\xc0\x02?\xd4WX?\x82v ?\xd3\x83\x16=\xe9\xb8!>\xee%3?l\xc6\x85>\x17\xeb\xf0=)f\x1e?\x0b~\x17?J\xe9\x9e=.V%?\xfd\xbb\xfc>\x98\xa4\xc4>\xc3\xf0\x1b?\x96\xaa\xa5>\x19E\xfb>\xf3\x95G?\xf8[\xd7>\x97\xdf\x82>B\xd8\x8d>\x86\xac\xa3>g\xe7\x9d>\x18\xbeo?;T\xbc>\x0f\x9af?\xee{\xbf>\x94h\xca>\xad\x1eP?Co\xb0>{\xae-?,\xb6\xd1>l\x94\xed>u>\xc0>\xdd\xb6n?[\xb2}>\xc9\x9au?-\x7f]?JE.?\xa2\x16\r>HSk?\xca\x9an?\xaf\xd3i?\xae\xcdD?x\x0c\x10?R\xfd\"?Y\xd0\xd3>\x18]1?\xf2\xa4\x96>JK\xcb=\x82\x87$?\xebgZ?Okr?\x8b\x16'?\x94<@>\xab\xfe%?\xce+\x02?\xb9\xf0\x9c=\x17x=>yP|?\xfe\xb8\x0c?\xe9+\xb0>\xf0E\x15?-\x8b\x1d?\x89&h?\x919\x06>c9\x12>\r\xf0J?\x19\x9fm?\xfbR\xd3=r\x84>?A\xeeL?\x9f\xc94?\xd9\xe3V?e^E?r\xc8\xe8>o\xcbb?X\xf8\x83>\xe8\xcbZ?8\x06e>\xa0\x9e\xe8>\x99\x0e\xcd>\xaf\x00@=\xb9\xbbc?C\a\x1d>R\\9?6\xad\\?\x1e\x13\xbe>8\xabW?\x90q\x0b?\xd1\x04B?\x98\"\xd0>\xcc\xf4\xb1=\x06\xccE?U\xea\x04?\x8f\xdb\x9e=\\k\x02?\xb9\xe8\xcb>\xb8s\xc7>\xf5\x83\xed>\xa3\x01\xb5<\x06mt?<@\xb2<*E\xe6>Yc\xe1>W\x994>\tG\x0f?6\x9ev?\x86l\x03?\xf6\xbc\x1e?3M,>\xb2\xd1\x8f>\xf8\xa1\a?\xd5\xcbm>\xd6Wz?\xaf\xbc/?\xa6%\n?\xa3Sw?\xc8$\x03<\x04\x9f\x1d?\xc9\xdb\xb5>:\x90\x0e>\xd3\xce\xf8>,cV?\x1a\xf3\x1e?\x00\x1c\xb8>\xe2\xba\xb9>\xd4Oj>\xcd\xebq?b\x8eB?\xed4\x1c?\xa3\xe3X?6\xd2\x9f>\xf6z\x89>\x95\xb6\xa2>(,Y?n\xe2\xec<\xf8\x90M?\x91\x04P?\x8fR\x9c=\xff\x8ba?7\xeeb?F\x02&?Q\x9dS>\xcd\xefb?\xab\x80\x04?\x89\x91\xdd>\xfe\xa7\xf2=\x15\x0e[?)GG?pB\x12?W\x0b2?+\xc1w>\x99\xa8\xd7>\x88hn?U\xfa~?\x16\x93\xad=\x94,\x00>z$)=\x01\xb3\b?]7\x02?\x8d\x84\x8c<e\xcb^?\xc1\xc4_?.\xc6$?\xe3\x8d >'/\x18?\x13\x93,>\x0e`\t?%\xf9/?\x12\x80\x1b?G\x82\xe7>\x15\xba\x0b?|h\x0e>!o\x02?\n\xefY?\x9c\x0b\n>GUz=\xe5=n?\xda\xa0a>\xe0%7?M\xbf\xf7>\x9a\xefA?\xfe\xfbX?~\xf9\x0b?z\r\xf6>\xe1qm?\t\xce\xbe>\x95\xa2\x13?\xe0\xb1V?e\x9cH>\xcd\xe1\xe9>\xd0\xbcZ?\\\xad\x17>\x9c\xeb\xe0>\xed\x8fm?}D8?)\xf9\xea>\xce\xff\x97;@_\x12?\xf3\xf7>?qrj=\x0b\xb2\xc2>/%\r?\xfe}\x17>\xf5\x9d\x15?0\xf8\xdf>\xfc\r)?\x85\x89K?\xb1</>le\xde>\xfb_\x86>\xc2\x0b\x05?/\xcd\x1b?\xd3\x02\xc2>\xad\xd8\xa0>\xc5\xc2\x81<\xdbW\xf7>\xcbH\xb5>\xbfs@>\xee\x1do?\xe0h/?s\xf6\xc3>\xcd\xcf\x90>{\x8a\x19>E\xe2\x0e>\x154\xab>YXV?\x8b\xcb\xac>;'\x13?\xb2\xb5\xe3>\x99J\xfa>\xd9\x806>{ob?Y\xd6\xdf=JZ\x81>[\tE?\xc4\xe2\xa7>g\x90\x8a=\x95\xa1\x1f>\x9b\x82\xe2>,\n8>\xa7\xfcr?x\xe5>>\xd7\x9b'?\x16nQ?)P\xcc>\x9e@\xe3>4\xa5\x13>\xc4\x8a\x1e>6FF?\xe6\xd4[?\x88Vw;\xc2G\b?\xb2\xa4H=\n\xbc\xd2>\xcd%q?\x85F\x1e?\xe8u\xb3>\x1d\xeaL>\xa6\x96\x98>\x97\xdd&?\xae!+?\xe9\x99\\?\x16\x8a\x18?qNW?I\x1b\xa9>\xf5&C=:\xc3\x02?U\x9cW?c\xb1$?\xc7w9?\xdcA\x9f>\xd5\x11\x1e>\x97\xb4y?a\x9c\"?;\xcb)?\xd8\xc5\xf6>\xe8\xe8\xf3>7\x1a^?\xb4N6=\xf4Qa?o\x82\xcb>&y%?\xdfI3?Q3\x04>^\x9d\xf0>\xbeF\x00?Z\xcd\x06?T\xbc]? o\x0e?\xe7F\xe5>\"p1>\x88y\xad>\xa9\xc4O?\xe3\x0c\xcb>\x12V\x0e?\xfdu??>\xd2:>,\x80F?\xaeUa?\xcdSL?D@S>m\x9c >\xe9\x80\xf4>\x10\x80n>\x15\x7fE?4\x90@?\xc5\xc5.?\xfa\"D?4Ed>\xe1\x0f\xa0=\x8a}\x1e>\x98\xdeZ?t\xd0O?\xd8A6?\xe4\x9e[=\xe1Z(??&=>\xd9\x00E=SD,>W\x87c?\x1a\x17\x04?\"5\xc7>)\tM?~\xd5$?\x11Y\xfe=w\xd7\x10>\x99<\xd9=\x86r\xee>D\xa1\xc2<\xbb\xe6\x86>|B\x8f>\x13\xedB?\xb4t\x96>\x8c3\xc0>\xfbx\xf2=\x9aX\xe9>`\x06\xdf>Lc7?\xb9\xad\x18?\x90\xe3q?\xcfs\xad>\xca>{?\x94\xd72?\x8f&\\?\xc9/9?\x98\xdd_>\a\x9a5?7\x87c?R\xdd\xf2=%6->\xc0z\xa9<\x8b\x0b;=\xdd\xc1\xd7>:\xeeu?jM*?X\xed\\?}Q0=\x84\xf1z?I\xef\xbf=+\x00\x8d>\xae\xe1\xc8<\xdd\xd5-?\xa16\xae=\xb7\xb8h?\x19n\xd3>\x97Gc?\xc0\xbf\xed>q\x8f\xaf>W\x00!<E?\x9b>\xef\xba\xd5>\xd1\xd8\xcc=J\xa7\xc3>\xc0\xf97<\xbb\x06\x83=h\x8fP>\xce\a\xe0=g&\xa0>)\x93T?@V*?\xc6\xc1I?\xce\xe4\xb7>c&N?<\xfa\x0b>\b\x11\x1b?m\xadD>\x9c\xdeC?N\xa4\xa7>j`\xcc>D\xf4:?f{\x13?\x0f\xdb$?|\xc8m?b\xccP?z\a\x1d?\x0cJp?'\x97\xc0>\bd\x7f?\xce\x1b\x8b=w\xa2E?gO\x0e?0\xeb\xae>\x84K\x89>{Yg?\xff~/=\xd5\a\x8e=r\xaay?\xfa\xcd\x1d?\x12W]?\xea\xa5\x15?*{\x1a?\x83c\xe8>b\x84\x0e?a\xf4\x1b?\x1c\xcb\x1d?d\x82t=\xff\xba\x85>\xa2\xce\xca>\xe6\b\x16?\x1b\xc0&?\x8e\x7f\xe9>\xb8\x96N=\xc1[B?\xd3b\x14>.\x810?\x13\x83S?Mri?W\xc8]?\x9e\x9d_=}OT=\xe2]6>C\xb8\x98=\xf8\x13\x99>y\xed,?\x94\xbb\x15?\xae\xb7\xb0> \xf0S?V\xcdG?\x16K\xf9>\x8f\xf7X>\x11v\r>N\x861?\xe5\xca\x94>KcO?\xf4\xb2 ?\xdd\x0b\x7f?,F\x97=\xb7)\xfe>\x1e\xa4H?\x8e\xd6V?\x0e\xbfG?\x8a\xa2p?oF7?1H\xec>\xc3\x1a:>J\xe55=\x98\x0eA?%t7?\xe8\x97\xe2=$\x15V?\x80\x9c\x0f??0%?\x93\x83]>Z\xb8j>\xb8\x0c\xf6>D\xb0m?M~L?\x8a\xc9\xd5=\xaf\xa5\xa2>\x0eEp?\x96O'>\xbd\x177?0R\xec>4\xeb\xcf>\x81\x15\xfb>\xba\x82\x11?,\x1d\xfb>\xab\x17\xe9>\xe2Q\x1a?o\x8aO?<\xa4K?\x89\xe1=?\x01\x96\x80>\x11*\x05?\xcaY^?\"z\x8b=\xafe\x1f?\x1a;k?~_\xc8>\xcf\xd9\xe8>\xd9J\x14>\xcc\xdf&?\x87\xe7~?\xaa\xa2\x05?\xf7}Y?\xd1\xd4\xa2>\xa9z\x9f>\xc7\xeeW?" 

search with vector

"FT.SEARCH" "idx" "*=>[KNN 1 @item_keyword_vector $vec_param AS vector_score]" "RETURN" "3" "vector_score" "item_name" "item_keywords" "SORTBY" "vector_score" "ASC" "DIALECT" "2" "LIMIT" "0" "1" "params" "2" "vec_param" "\xcd\xcc\xcc=\xcd\xccL>\x9a\x99\x99>\xcd\xcc\xcc>"

Go string vector encoding

a := []float32{0.1,0.2,0.3}
e := fmt.Sprintf("%q",a)
import json
import time

import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
# from sentence_transformers import SentenceTransformer

url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()


client = redis.Redis(host="localhost", port=6379, decode_responses=True)

res = client.ping()
# >>> True

client.flushall()

pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
    redis_key = f"bikes:{i:03}"
    pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]

res = client.json().get("bikes:001", "$.model")
# >>> ['Summit']

keys = sorted(client.keys("bikes:*"))
# >>> ['bikes:001', 'bikes:002', ..., 'bikes:011']

descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
# embedder = SentenceTransformer("msmarco-distilbert-base-v4")
# embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
embeddings = np.random.rand(len(bikes), 4).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
# >>> 768

pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
    pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()
# >>> [True, True, True, True, True, True, True, True, True, True, True]

res = client.json().get("bikes:0")


# >>>
# {
#   "model": "Summit",
#   "brand": "nHill",
#   "price": 1200,
#   "type": "Mountain Bike",
#   "specs": {
#     "material": "alloy",
#     "weight": "11.3"
#   },
#   "description": "This budget mountain bike from nHill performs well..."
#   "description_embeddings": [
#     -0.538114607334137,
#     -0.49465855956077576,
#     -0.025176964700222015,
#     ...
#   ]
# }

schema = (
    TextField("$.model", no_stem=True, as_name="model"),
    TextField("$.brand", no_stem=True, as_name="brand"),
    NumericField("$.price", as_name="price"),
    TagField("$.type", as_name="type"),
    TextField("$.description", as_name="description"),
    VectorField(
        "$.description_embeddings",
        "FLAT",
        {
            "TYPE": "FLOAT32",
            "DIM": VECTOR_DIMENSION,
            "DISTANCE_METRIC": "COSINE",
        },
        as_name="vector",
    ),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
    fields=schema, definition=definition
)
# >>> 'OK'


query = (
    Query('(*)=>[KNN 3 @vector $query_vector AS vector_score]')
    .sort_by('vector_score')
    .return_fields('vector_score', 'id', 'brand', 'model', 'description')
    .dialect(2)
)

encoded_query = [0.1, 0.2, 0.3, 0.4]

docs = client.ft("idx:bikes_vss").search(query, {'query_vector': np.array(
    encoded_query, dtype=np.float32).tobytes()}).docs

print(docs)

Metrics

Here are the typical ranges for the following mathematical functions:

  1. Cosine Similarity:

    • Range: [-1, 1]
    • Explanation: Cosine similarity measures the cosine of the angle between two vectors and can take values between -1 and 1.
    • A value of 1 indicates that the two vectors are identical and have the same direction.
    • A value of -1 indicates that the two vectors are diametrically opposed, pointing in opposite directions.
    • A value of 0 indicates that the two vectors are orthogonal, meaning they are perpendicular to each other.
  2. Dot Product:

    • Range: (-∞, +∞)
    • Explanation: The dot product is a scalar product of two vectors and can range from negative infinity to positive infinity. There are no constraints on its value, and it can be positive, negative, or zero.
    • The dot product is used in various mathematical operations and is not bounded within a specific range.
  3. Euclidean Distance:

    • Range: [0, +∞)
    • Explanation: Euclidean distance measures the straight-line distance between two points in Euclidean space. It is always a non-negative value.
    • When the two points are identical, the Euclidean distance is 0.
    • As the points move further apart, the distance increases and approaches positive infinity if there is no upper bound on the space.

Keep in mind that these are the general ranges for these mathematical functions, but their specific application and interpretation can vary depending on the context in which they are used.