Cannot replicate the XR-Linear performance with TF-IDF features and a Fine-tune Embedding Model
keshavgarg139 opened this issue · 0 comments
keshavgarg139 commented
I am trying to replicate the XR-Linear with TFIDF and pre-trained embeddings from the BGE model.
parsed_result = Preprocessor.load_data_from_file(input_text_path, output_text_path)
Y = parsed_result["label_matrix"]
corpus = parsed_result["corpus"]
preprocessor = Preprocessor.train(corpus, {"type": "tfidf"})
tfidf_X = preprocessor.predict(corpus)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(data.query_string.values,show_progress_bar=True,batch_size=2048)
print(embeddings.shape)
I am concatenating the TFIDF features with BGE embeddings, horizontally as follows:
X = scipy.sparse.csr_matrix(scipy.sparse.hstack((tfidf_X,embeddings)))
Model Training:
label_feat = LabelEmbeddingFactory.create(Y, X, method="pifa")
cluster_chain = Indexer.gen(label_feat, nr_splits=4)
xlinear_model = XLinearModel.train(X, Y, C=cluster_chain,negative_sampling_scheme="tfn")
Prediction:
def process_query_and_predict(query, use_cpu_threads):
tfidf_vector = preprocessor.predict(query)
bge_embedding = model.encode(query)
pred_X = scipy.sparse.csr_matrix(scipy.sparse.hstack((tfidf_vector,bge_embedding)))
Y_pred = xlinear_model.predict(pred_X)
return smat_util.sorted_csr(Y_pred)
BUT, surprisingly model performs quite poorly as compared to the model trained on standalone TF-IDF vectors.
I saw the tutorial for PECOS XR-Linear . And I was trying to replicate the process carried out. Instead of AttnXML I want to introduce semantic capabilities via BGE model.
Can someone share some insights here, where I could be going wrong.. and how can I merge the tf-idf features with BGE embeddings.
Thanks