/weighted-TFIDF-with-BERT

Research Project for CS510 Advanced Information Retrieval at UIUC

Primary LanguageTeX

Improved Semantic Search based on Weighted TF-IDF & BERT

Abstract

We present an improved semantic search approach based on a weighted TF-IDF method and the BERT natural language model. We motivate the choice of a weighted TF-IDF method via an intuition that the questionable spans in a document summarize the document's topics and hence, should be placed greater emphasis when calculating the TF-IDF score. The use of the BERT natural language model is to complement the weakness of the TF-IDF framework in understanding the true semantic meaning of a document. Therefore, our model encodes a document's questionable spans and true semantics. It scales effectively in the size of the dataset. In a number of semantic search experiments on question-answering datasets, we demonstrate that our approach outperforms traditional models by a significant margin.

Running the Project

  1. Get the dataset into data/ folder;
  2. Import data/cs510project_new_words.sql and data/cs510project_new_sentences.sql in a MySQL database;
  3. Change the MySQL username and passwords in tfidf_search.py;
  4. Start query_type_server.py;
  5. Start bert-as-service on localhost. See hanxiao/bert-as-service for details. We use the BERT-Base Uncased model (12-layer, 768-hidden, 12-heads, 110M parameters) for BERT vectors;
  6. Run main.py and start searching.

About the course (Fall 2018)