amzn/pecos

Cannot replicate the XR-Linear performance with TF-IDF features and a Fine-tune Embedding Model

keshavgarg139 opened this issue · 0 comments

I am trying to replicate the XR-Linear with TFIDF and pre-trained embeddings from the BGE model.

parsed_result = Preprocessor.load_data_from_file(input_text_path, output_text_path)
Y = parsed_result["label_matrix"]
corpus = parsed_result["corpus"]

preprocessor = Preprocessor.train(corpus, {"type": "tfidf"}) 
tfidf_X = preprocessor.predict(corpus) 

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(data.query_string.values,show_progress_bar=True,batch_size=2048)
print(embeddings.shape)

I am concatenating the TFIDF features with BGE embeddings, horizontally as follows:

X = scipy.sparse.csr_matrix(scipy.sparse.hstack((tfidf_X,embeddings)))

Model Training:

label_feat = LabelEmbeddingFactory.create(Y, X, method="pifa")
cluster_chain = Indexer.gen(label_feat,  nr_splits=4)
xlinear_model = XLinearModel.train(X, Y, C=cluster_chain,negative_sampling_scheme="tfn")

Prediction:

def process_query_and_predict(query, use_cpu_threads):
    tfidf_vector = preprocessor.predict(query)
    bge_embedding = model.encode(query)
    pred_X = scipy.sparse.csr_matrix(scipy.sparse.hstack((tfidf_vector,bge_embedding)))
    Y_pred = xlinear_model.predict(pred_X)
    return smat_util.sorted_csr(Y_pred)

BUT, surprisingly model performs quite poorly as compared to the model trained on standalone TF-IDF vectors.
I saw the tutorial for PECOS XR-Linear . And I was trying to replicate the process carried out. Instead of AttnXML I want to introduce semantic capabilities via BGE model.

Screenshot 2024-04-05 at 12 42 13 PM

Can someone share some insights here, where I could be going wrong.. and how can I merge the tf-idf features with BGE embeddings.

Thanks