/info_retrieve

Framework for a information retrieval engine (QnA, knowledge base query, etc)

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

GoldenRetriever - Information retrieval using fine-tuned semantic similarity

GoldenRetriever is part of the HotDoc NLP project, which provides a series of open-source AI tools for natural language processing. HotDoc NLP is part of the AI Makerspace program. Please visit the demo page where you will be able to query a sample knowledge base.

Golden Retriever is a framework for a information retrieval engine (QnA, knowledge base query, etc) that works in 4 steps:

  • Step 1: The knowledge base has to be separated into "documents" or clauses. Each clause is an indexed unit of information e.g. a clause, a sentence, or a paragraph.
  • Step 2: The clauses (and query) should be encoded with the same encoder (Infersent, Google USE1, or Google USE-QA2).
  • Step 3: A similarity score is calculated (cosine dist, arccos dist, dot product, nearest neighbors).
  • Step 4: Clauses with the highest score (or nearest neighbors) are returned as the retrieved document.

model_finetuning.py currently optimizes the framework for retrieving clauses from a contract or a set of terms and conditions, given a natural language query.

There is a potential for fine tuning following Yang et. al's (2018) paper on learning textual similarity from conversations.

A fully connected layer is inserted after the clauses are encoded to maximize the dot product between the transformed clauses and the encoded query.

In the transfer learning use-case, the Google-USEQA model is further fine-tuned using a triplet-cosine-loss function. This helps to push correct question-knowledge pairs closer together while maintaining a marginal angle between question-wrong-knowledge pairs. This method can be used to overfit towards any fixed FAQ dataset without losing the semantic similarity capabilities of the sentence encoder.

Deployment

This model is implemented as a flask app.

Run python app.py to launch a web interface from which you can query some pre-set documents.

To run the flask API using docker,

  1. Clone this repository.
  2. Build the container image: docker build -f api.Dockerfile -t goldenretriever .
  3. Run the container: docker run -p 5000:5000 -it goldenretriever
  4. Access the endpoints at http://localhost:5000.

Alternatively, to run the streamlit app using docker,

  1. Clone this repository.
  2. Build the container image: docker build -f streamlit.Dockerfile -t goldenretriever .
  3. Run the container: docker run -p 5000:5000 goldenretriever
  4. Access the web interface on your browser by navigating to http://localhost:5000.

Testing

For comparison, we apply 3 sentence encoding models to the data set provided at InsuranceQA corpus. Each test case consists of a question, and 100 possible answers, of which the correct answer is one or more of the 100 possible answers.

Model evaluation metric is accuracy@k, where k is the number of clauses our model returns for a given query. A top score of 1 indicates that the returned k clauses contains a correct answer to the query, and a score of 0 indicates that none of the k clauses returned contain a correct answer.

Model acc@1 acc@2 acc@3 acc@4 acc@5
InferSent 0.083 0.134 0.1814 0.226 0.268
Google USE 0.251 0.346 0.427 0.481 0.534
Google USE-QA 0.387 0.519 0.590 0.648 0.698
TFIDF baseline 0.2457 0.3492 0.4127 0.4611 0.4989

Footnotes

  • 1 Google Universal Sentence Encoder
  • 2 Google Universal Sentence Encoder for Question-Answer Retrieval

Acknowledgements

This project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG-RP-2019-050). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore.