Document_Retrieval

Proof of concept to retrieve documents belonging to a domain. ~300 Documents are converted from ppt to text and then document retrieval is performed on it using traditional and semantic search methods. It is work in progress project.

Content

[Setup Instructions]
[Traditional Method for Document Retrieval]
- [Approach]
- [Disadvantages]
[Semantic Search]
- [Approach]
- [Expanding the Query]
- [Advantages of semantic search over traditional keyword search]
[Usecases]
[Further Reading]
[References]

Setup Instructions

Just issue below command to download the repo and install dependencies such as tensorflow.

git clone https://github.com/vamshi-indla/Document_Retrieval.git
cd Document_Retrieval
./env_setup.sh

Traditional Method for Document Retrieval

Approach

Convert PDFs to Text (doc_to_text.py) 1.1 Preprocess the data
Index the documents using inverse-index
Term weighting: Importance of the terms used within the document are calculated with the help of term frequency.
Similarity coefficients: Documents and queries are represented by vectors of term weight.
Retrieval: Retrieval is done by cosine similarity.

Disadvantages

Term mismatch is the most concerning problem for effective information retrieval. In that, there are multiple kinds of problems namely:

Vocabulary problem: The words on which the documents are indexed (vs) the words in user query are not same
Synonymy: Same words different meanings (Ex: “apple” as company [vs] fruit) Synonymy may result in a failure to retrieve relevant documents Decreases Recall
Polysemy: Different words with same meaning (Ex: “television” and “tv”) Polysemy may cause retrieval of erroneous or irrelevant documents Decreases Precision of retrieval.
Hypernymy and Hyponymy:

Semantic Search

Approach

Preprocessing data
Finding Synonyms, Polysemy , Hypernymy and Hyponymy
Creating topics
Create Domain specific Word Embeddings OR Use Transfer Learning
- Word2vec (2013) and SVD?
- CBOW
- Skip Gram with negative Sampling
- Glove (2014)
- SVD , similar to PCA
Preprocess Query
Expand the Query
Find the nearest document to Query
- Cosine
- Wordmoverdistance
- Euclidean Distances

Expanding the Query

User inputs query in natural language.
Use tools like StanfordParser to identify the noun phrases and other grammar in the query.
Related synonym sets of various words in the query are also obtained from Ontology and Word Net API.
Add these words to the original query and form the new query.
The queries formed will be more refined and are sent to Search API which fetches the results related to the user query. Following diagram depicts the same: image

Example run

User Query: name of football clubs in EEFA.
Parsed words for this user query using Stanford Parser:
Word Net and Ontology Synonym words: list, soccer
Expanded Query: Name or list the football Soccer clubs in EEFA

Advantages of semantic search over traditional keyword search

Tradional keyword search will not be able to understand the difference between: USA Players in Catalan basket team Vs Catalan Palyers in USA teams. Such cases are not a problem for semantic search.

Usecases

Analogies
Predicting next word, using sequence modeling
Fill up the blanks :)
Sentiment Analysis
- Word Embeddings is best, when small train labeled examples
- Use average or Sum of all the embeddings and that can work. However it fails for Sarcarm examples. Ex: lacking good taste.
- Use RNNs or LSTM in that scenario
Machine Translation and Captioning an image
- Greedy Search
- Beam Search
Speech Recognition
Topic Modeling

References

https://github.com/eBay/Sequence-Semantic-Embedding
https://spoddutur.github.io/my-notes/semantic-search-2.html
https://opensourceconnections.com/blog/2013/08/25/semantic-search-with-solr-and-python-numpy/ (collabarative filtering search)

vamshi-indla/Document_Retrieval

Document_Retrieval

Content

Setup Instructions

Traditional Method for Document Retrieval

Approach

Disadvantages

Semantic Search

Approach

Expanding the Query

Example run

Advantages of semantic search over traditional keyword search

Further Reading

Usecases

References