- Group members: Yuxiang Li, Yuxin Xiao, Zhen Fan
- Research report: Improved Semantic Search based on Weighted TF-IDF & BERT
We present an improved semantic search approach based on a weighted TF-IDF method and the BERT natural language model. We motivate the choice of a weighted TF-IDF method via an intuition that the questionable spans in a document summarize the document's topics and hence, should be placed greater emphasis when calculating the TF-IDF score. The use of the BERT natural language model is to complement the weakness of the TF-IDF framework in understanding the true semantic meaning of a document. Therefore, our model encodes a document's questionable spans and true semantics. It scales effectively in the size of the dataset. In a number of semantic search experiments on question-answering datasets, we demonstrate that our approach outperforms traditional models by a significant margin.
- Get the dataset into
data/
folder; - Import
data/cs510project_new_words.sql
anddata/cs510project_new_sentences.sql
in a MySQL database; - Change the MySQL username and passwords in
tfidf_search.py
; - Start
query_type_server.py
; - Start bert-as-service on localhost. See hanxiao/bert-as-service for details. We use the BERT-Base Uncased model (12-layer, 768-hidden, 12-heads, 110M parameters) for BERT vectors;
- Run
main.py
and start searching.
- Course page
- Instructor: ChengXiang Zhai
- Location: 0216 Siebel Center (SC), UIUC