Improved Semantic Search based on Weighted TF-IDF & BERT

Group members: Yuxiang Li, Yuxin Xiao, Zhen Fan
Research report: Improved Semantic Search based on Weighted TF-IDF & BERT

Abstract

We present an improved semantic search approach based on a weighted TF-IDF method and the BERT natural language model. We motivate the choice of a weighted TF-IDF method via an intuition that the questionable spans in a document summarize the document's topics and hence, should be placed greater emphasis when calculating the TF-IDF score. The use of the BERT natural language model is to complement the weakness of the TF-IDF framework in understanding the true semantic meaning of a document. Therefore, our model encodes a document's questionable spans and true semantics. It scales effectively in the size of the dataset. In a number of semantic search experiments on question-answering datasets, we demonstrate that our approach outperforms traditional models by a significant margin.

Running the Project

Get the dataset into data/ folder;
Import data/cs510project_new_words.sql and data/cs510project_new_sentences.sql in a MySQL database;
Change the MySQL username and passwords in tfidf_search.py;
Start query_type_server.py;
Start bert-as-service on localhost. See hanxiao/bert-as-service for details. We use the BERT-Base Uncased model (12-layer, 768-hidden, 12-heads, 110M parameters) for BERT vectors;
Run main.py and start searching.

About the course (Fall 2018)

Course page
Instructor: ChengXiang Zhai
Location: 0216 Siebel Center (SC), UIUC

zhenalexfan/weighted-TFIDF-with-BERT

Improved Semantic Search based on Weighted TF-IDF & BERT

Abstract

Running the Project

About the course (Fall 2018)