DaoD/INTERS

Embedding vs parametric knowledge

ashokrajab opened this issue · 5 comments

Work focused only on fine-tuning LLMs in order to make them perform better at IR task only using its parametric knowledge.

But usually, embeddings will be generated for query and documents and using some nearest neighbour algorithm, we will find the top relevant documents.

Would like to know your thought process behind this, as to why you went with the parametric knowledge as opposed to the resorting to embedding generation and approximate nearest neighbour search.

Thanks for your interest in our work. I agree with you that embedding-based methods are commonly used in retrieval algorithms. However, information retrieval encompasses more than just the retrieval process. Our study is centered on the broader capabilities of LLMs in executing various IR tasks, including a range of generative tasks, as demonstrated in our paper (for instance, in query reformulation).

Our work can improve the performance of the reranking task, which shares some similarities with the retrieval task. However, as we stated in the paper, we have not explored applying our model to the retrieval task (such as dense retrieval, as you mentioned). This is primarily because the training methodology for retrievers differs substantially from the instruction tuning approach we currently employ. It is interesting to investigate whether our method can also improve the model's performance on dense retrieval. We will check this once our ongoing projects are completed.

Thank you, and we welcome any further discussions on this topic.

Thank you for the quick reply.

One more clarification needed.
I would like to know how you calculated the MRR@10 and nDCG@10 for MSMARCO data for INTERS-LLaMA-Task Description?
MRR and nDCG involves comparision of document ranks whereas INTERS-LLaMA-Task generates the document content directly, I cannot understand how this metric is calculated.

Also, how are the instruction handled during inference time for the fine-tuned model?

During training, we know the kind of dataset and instruction were generated that aids better answer generation.
But during inference we will not have this information. So how is this handled?

Q1: How to compute MRR and nDCG.
A1: Currently, there are three kinds of methods (pointwise, pairwise, and listwise) that can apply text generation for document ranking. You can refer to Section 5.2 in this survey paper (https://arxiv.org/pdf/2308.07107.pdf) for more details. In INTERS, we use the pointwise method. We will highlight this in our paper.

Q2: About the task generalization.
A2: In Section 4.4 of the INTERS paper, we consider the out-of-domain evaluation scenario. In the task-level generalization evaluation, we remove some tasks from the full set and test them in a zero-shot manner. We find that even if some tasks are not trained, our fine-tuned model can still perform them well. For these unseen tasks, the dataset and instructions can be treated as "we do not have the information" as you mentioned. For other tasks that are not included in INTERS, we believe that they can be performed by using any reasonable instructions. This is also the motivation of our paper: making the LLMs understand the IR tasks better.

Thank you for your response.