VinAIResearch/PhoBERT

Semantic search with Sentence Transformers

icesonata opened this issue · 2 comments

Currently, I'm a student working on a project related to semantic search with Vietnamese corpus. My goal is to build a model which can process a user's query and return sentences or documents that related the most to the given query. My strategy is using PhoBERT with sentence-transformers to embed sentences & documents into vectors and then store those to FAISS index for query.

I have conducted semantic search experiment as I described above with both PhoBERT (base and large versions) but it turned out very bad results for the Vietnamese corpus, while performing pretty good in the English corpus. After identifying the problem, I realized the problem lied on the quality of encoding results. Hence, I have a couple of questions here:

  1. Have you ever tried PhoBERT with sentence-transformers or any similar embedding techniques with PhoBERT on Vietnamese corpus for semantic search? What is your approach and how was the outcomes?
  2. Does sentence embedding with PhoBERT on Vietnamese text require any further preprocessing techniques besides word segmentation?
  3. Is it efficient to embed a whole a Vietnamese document (~200 words) with sentence-transformers and PhoBERT?

Additionally, I did preprocess text using only word segmentation for Vietnamese corpus before passing them to embedding process.

I am new to NLP so I am really appreciate if you could give me some advice on this project.

How you're incorporating PhoBERT with sentence-transformers? Why not using pre-trained multilingual models available on sentence-transformers first?

How you're incorporating PhoBERT with sentence-transformers? Why not using pre-trained multilingual models available on sentence-transformers first?

You were right. Back then I misunderstood that I can load any model in sentence-transformers so I blindly did it without checking. I moved to available multilingual models of sentence-transformers and the performance increased.

Thanks, I'm closing the issue.