Abraxas-365/langchain-rust

SurrealDB : ASSERT Error ( vector_dimensions )

Closed this issue · 6 comments

Hello,

when I use surrealdb as a store, I always get this error:
field: Idiom([Field(Ident("embedding"))]), check: "(array::len($value) = 1536) OR (array::len($value) = 0)" })
when I add a document to the store :

    let store = StoreBuilder::new()
        .embedder(embedder)
        .db(db)
        .vector_dimensions(1536)
        .build()
        .await
        .unwrap();

    // Intialize the tables in the database. This is required to be done only once.
    store.initialize().await.unwrap();

    match store
        .add_documents(&docs, &VecStoreOptions::default().with_score_threshold(0.8))
        .await
    {
        Ok(_) => {}
        Err(e) => {
            println!("{:?}", e) // <---------------- here
        }
    };

the problem is here in "surrealdb.rs".

async fn create_collection_table_if_not_exists(&self) -> Result<(), Box<dyn Error>> {
        if !self.schemafull {
            return Ok(());
        }

        let vector_dimensions = self.vector_dimensions;

        match &self.collection_table_name {
            Some(collection_table_name) => {
                self.db
                    .query(format!(
                        r#"
                            DEFINE TABLE {collection_table_name} SCHEMAFULL;
                            DEFINE FIELD text                      ON {collection_table_name} TYPE string;
                            DEFINE FIELD embedding                 ON {collection_table_name} TYPE array ASSERT (array::len($value) = {vector_dimensions}) || (array::len($value) = 0);
                            DEFINE FIELD embedding.*               ON {collection_table_name} TYPE float;
                            DEFINE FIELD metadata                  ON {collection_table_name} FLEXIBLE TYPE option<object>;"#
                    ))
                    .await?
                    .check()?;
            }
            None => {
                let collection_table_name = &self.collection_name;
                dbg!(&collection_table_name);
                self.db
                    .query(format!(
                        r#"
                            DEFINE TABLE {collection_table_name} SCHEMAFULL;
                            DEFINE FIELD text              ON {collection_table_name} TYPE string;
                            DEFINE FIELD embedding         ON {collection_table_name} TYPE array ASSERT (array::len($value) = {vector_dimensions}) || (array::len($value) = 0);
                            DEFINE FIELD embedding.*       ON {collection_table_name} TYPE float;
                            DEFINE FIELD metadata          ON {collection_table_name} FLEXIBLE TYPE option<object>;"#
                    ))
                    .await?
                    .check()?;
            }
        }

        Ok(())
    }

simply remove all (temporary solution) :

ASSERT (array::len($value) = {vector_dimensions}) || (array::len($value) = 0)

and everything works properly.

or you need to find the right vector_dimensions ? ( 1536 )

I tried this, but it doesn't work :

    let docs = html_loader
        .load()
        .await
        .unwrap()
        .map(|x| x.unwrap())
        .collect::<Vec<_>>()
        .await;

    let size = docs[0].page_content.len(); //<---- not de right size

(minor remark : vector_dimensions must take usize as parameter, not i32 I think )

thank you in advance for your help.

Yes you need to find the right size for the specific model. I added this as db may perform optimizations if the size is well known. Usually I let it run for the first time and it should say the exact error on what was expected and I then change the size. Other option is to manually call embed_query() or embed_documents and check the len of the vector.

One option would be to make the dimensions size optional.

Makes sense to convert it to usize.

so we have to recreate "add_documents()" for Store ?

why not expose a function in the Store that returns the size of the vectorization?

or expose true vector usage in :

pub struct Store<C: Connection> {
    pub(crate) db: Surreal<C>,
    pub(crate) collection_name: String,
    pub(crate) collection_table_name: Option<String>,
    pub(crate) collection_metadata_key_name: Option<String>,
    pub(crate) vector_dimensions: i32,  //<<<<----------------------------- usize
    pub(crate) embedder: Arc<dyn Embedder>,
    pub(crate) schemafull: bool,

    pub vector_usage: usize, //<<<< !!!
}

and add this line to the fn add_documents :

      self.vector_usage = self.vector_usage + vectors.iter().map(|v| v.len()).sum();

like this :

#[async_trait]
impl<C: Connection> VectorStore for Store<C> {
    async fn add_documents(
        &self,
        docs: &[Document],
        opt: &VecStoreOptions,
    ) -> Result<Vec<String>, Box<dyn Error>> {
        let texts: Vec<String> = docs.iter().map(|d| d.page_content.clone()).collect();

        let embedder = opt.embedder.as_ref().unwrap_or(&self.embedder);

        let vectors = embedder.embed_documents(&texts).await?;

       // add this ************** ( add mut )
        self.vector_usage =  self.vector_usage + vectors.iter().map(|v| v.len()).sum();
     
        ....
       ....

why not expose a function in the Store that returns the size of the vectorization?

I would expect the size to be static for a table that is well known ahead of time which is either const, coming from config file/environment variables. It is up to you at how you want to implement get_vector_diemnsion_size.

async fn main() {
  let size = get_vector_dimension_size();

  let store = StoreBuilder::new()
    .embedder(embedder)
    .db(db)
    .vector_dimensions(size)
    .build()
    .await
    .unwrap();
}

If you want to know it on the fly you can do the following. I would highly suggest against this as you would be paying for unnecessary cost and compute. It might work for demo/POC purpose.

async fn main() {
  let openai = OpenAiEmbedder::default();
  let sample_vector = openai.embed_query("Sample embedding").await.unwrap();
  let vector_size = get_size_from_embeddings(sample_vector);

  let store = StoreBuilder::new()
    .embedder(embedder)
    .db(db)
    .vector_dimensions(vector_size)
    .build()
    .await
    .unwrap();
}

We could even go further and add this check before the db calls are made such that db connection is not wasted given the vector size can be large based on the embedding model you use.

@prabirshrestha thank you for your suggestions ,

for the first suggestion,

we have an "=" check, so we must have the same size,

ASSERT (array::len($value) = {vector_dimensions}) || (array::len($value) = 0)

why not use " <= " , it's more flexible ?

ASSERT (array::len($value) <= {vector_dimensions})

and we'll no longer need :
|| (array::len($value) = 0)

The reason for exact dimensions is the db may optimize indexing.

Closing this for now as it is expected.