guenthermi/postgres-word2vec

Using knn search when the input is python list (representing the embeddings)

igormis opened this issue · 8 comments

I have problems with sending the embedding to the knn_in_pq function.
I tried:
embedding = [0.12109375, 0.056640625, ..., -0.2421875] embedding = np.array(embedding) cursor.execute("SELECT * FROM knn_in_pq((%s), 2, ARRAY(SELECT event_name FROM events));", (embedding,))
it throws:
function knn_in_pq(record, integer, character varying[]) does not exist
LINE 1: SELECT * FROM knn_in_pq(((0.12109375, 0.056640625, -0.242187...

when I try with the list:
embedding = [0.12109375, 0.056640625, ..., -0.2421875] cursor.execute("SELECT * FROM knn_in_pq((%s), 2, ARRAY(SELECT event_name FROM events));", (embedding,))
it throws:
function knn_in_pq(numeric[], integer, character varying[]) does not exist
LINE 1: SELECT * FROM knn_in_pq((ARRAY[0.12109375,0.056640625, -0.24...

From the \df command I can see that the knn_in_pq function is overriden and here are the possible function calls:
`
public | knn_in_pq | TABLE(word character varying, similarity real) | query_vector anyarray, k integer, input_set integer[] | normal

public | knn_in_pq | TABLE(word character varying, similarity real) | query_vector bytea, k integer, input_set character varying[] | normal

public | knn_in_pq | TABLE(word character varying, similarity real) | token character varying, k integer, input_set character varying[] | normal

public | knn_in_pq | TABLE(word character varying, similarity real) | token character varying, k integer, input_set integer[]

`
I know that the embedding needs to be casted somehow to query_vector bytea (I think) but do not know how to do that. Can u help me on this?
If I send a word instead the vector, the function works properly, but I need to send a vector and to get the k most close events to the input vector

or maybe use the function with input query_vector anyarray, but still I have problems in conversion from numpy array or python list to query_vector anyarray...

The idea is given a new vector to find the words/event_names (or it can be movie titles) that are closest to the given vector

I have also tried this code sequence:

buying = [0.12109375, 0.056640625,... -0.016479492] #300d
cursor.execute("SELECT * FROM knn_in_pq(vec_to_bytea(%s), 2, ARRAY(SELECT event_name FROM events));", [buying])

it returns
Unknown element type: 1700

Finally I managed to solve the problem using following code sequence:
list_ = [0.1234, 0.43453, ... -0.02123] cursor.execute("SELECT * FROM k_nearest_neighbour_ivfadc(vec_to_bytea(%s::float4[]), 10)", (list_,))
however I still have issues with:
cursor.execute("SELECT * FROM k_nearest_neighbour_pq_pv(%s, 2)", [list_])
where I cannot use vec_tobytea as the function argument is anyarray. The same functions works when I use the varchar version, i.e.:
cursor.execute("SELECT * FROM k_nearest_neighbour_pq_pv((%s), 3)", [string]) this works.
Can u help me on the version where I need to send python list (representing the embedding).
here is the error:
too few arguments for format()
CONTEXT: PL/pgSQL function k_nearest_neighbour_pq_pv(anyarray,integer) line 8 at RETURN QUERY

b'SELECT * FROM k_nearest_neighbour_pq_pv((ARRAY[0.1234, 0.43453, ... -0.02123]), 2)'

I made changes to following procedure:

-- TODO ADAPT
-- with postverification
CREATE OR REPLACE FUNCTION k_nearest_neighbour_pq_pv(input_vector anyarray, k integer) RETURNS TABLE (word varchar(100), similarity float4) AS $$
DECLARE
pq_quantization_name varchar;
post_verif integer;
BEGIN
EXECUTE 'SELECT get_vecs_name_pq_quantization()' INTO pq_quantization_name;
EXECUTE 'SELECT get_pvf()' INTO post_verif;
RETURN QUERY EXECUTE format('
SELECT pqs.word AS word, cosine_similarity_bytea(**vec_to_bytea**(''%s''::float4[]), pqs.word) AS similarity
FROM pq_search(**vec_to_bytea**(''%s''::float4[]), %s) AS (idx integer, distance float4)
INNER JOIN %s AS pqs ON idx = pqs.id
ORDER BY cosine_similarity_bytea(**vec_to_bytea**(''%s''::float4[]), pqs.word) DESC
FETCH FIRST %s ROWS ONLY
', input_vector, **input_vector**, post_verif*k, pq_quantization_name, input_vector, k);
END
$$
LANGUAGE plpgsql;

With ** are the changes, I think there was one input_vector forgotten as an input...

Hi, yes you need to cast the array to the bytea type. If you don't want to use the vec_to_bytea function, I think you can use psycopg2.Binary(np.array(embedding, dtype='float32')) in your python script.

The k_nearest_neighbour_pq_pv function seems to be deprecated. Previously I used the PostgreSQL array type for the vectors, however changed to the bytea type. It seems like I planned to update it but never actually did it. That would explain the "TODO" above the function. Thank you for adopting the function.

@guenthermi yes I have corrected all the pg_pv and ivfadc_pv functions. However I was wondering concerning the part of index createion tables, like the ones create with this code: pq_index.py and ivfadc.py is there are way to dynamically create new index if I dynamically add word to vector mapping in the database, i.e. when I add new word and embedding in the database not to recreate the index tables, but to add this index to the tables.

You may use the insert_batch function. However, I don't know if this is applicable to your use case.
This function creates vectors for text values not in the vector table. Therefore, it tokenizes the text values and determines the vector for each token. Then, it calculates the centroid of those vectors and adds it as a vector representation to the vector table. Further, it calculates index entries for this new centroid vector and adds them to the indices.
Maybe you can use this, however, it some limitations:

  • Originally, the system is designed to store the complete set of word embedding vectors in a separate table next to the structured data in the database. Thus it assumes that all text values either have a vector representation in the embedding table, or their tokens have a vector representation in the table. It does not consider the case of adding completely new vectors to the vector table (which do not derive from a combination of already present vectors).
  • The insert_batch always updates all index structures. This might be a problem if you did not generate all indices. Moreover, it is not very fast.