Proof of concept to retrieve documents belonging to a domain. ~300 Documents are converted from ppt to text and then document retrieval is performed on it using traditional and semantic search methods. It is work in progress project.
- [Setup Instructions]
- [Traditional Method for Document Retrieval]
- [Approach]
- [Disadvantages]
- [Semantic Search]
- [Approach]
- [Expanding the Query]
- [Advantages of semantic search over traditional keyword search]
- [Usecases]
- [Further Reading]
- [References]
Just issue below command to download the repo and install dependencies such as tensorflow.
git clone https://github.com/vamshi-indla/Document_Retrieval.git
cd Document_Retrieval
./env_setup.sh
- Convert PDFs to Text (doc_to_text.py) 1.1 Preprocess the data
- Index the documents using inverse-index
- Term weighting: Importance of the terms used within the document are calculated with the help of term frequency.
- Similarity coefficients: Documents and queries are represented by vectors of term weight.
- Retrieval: Retrieval is done by cosine similarity.
Term mismatch is the most concerning problem for effective information retrieval. In that, there are multiple kinds of problems namely:
- Vocabulary problem: The words on which the documents are indexed (vs) the words in user query are not same
- Synonymy: Same words different meanings (Ex: “apple” as company [vs] fruit) Synonymy may result in a failure to retrieve relevant documents Decreases Recall
- Polysemy: Different words with same meaning (Ex: “television” and “tv”) Polysemy may cause retrieval of erroneous or irrelevant documents Decreases Precision of retrieval.
- Hypernymy and Hyponymy:
- Preprocessing data
- Finding Synonyms, Polysemy , Hypernymy and Hyponymy
- Creating topics
- Create Domain specific Word Embeddings OR Use Transfer Learning
- Word2vec (2013) and SVD?
- CBOW
- Skip Gram with negative Sampling
- Glove (2014)
- SVD , similar to PCA
- Preprocess Query
- Expand the Query
- Find the nearest document to Query
- Cosine
- Wordmoverdistance
- Euclidean Distances
- User inputs query in natural language.
- Use tools like StanfordParser to identify the noun phrases and other grammar in the query.
- Related synonym sets of various words in the query are also obtained from Ontology and Word Net API.
- Add these words to the original query and form the new query.
- The queries formed will be more refined and are sent to Search API which fetches the results related to the user query. Following diagram depicts the same: image
- User Query: name of football clubs in EEFA.
- Parsed words for this user query using Stanford Parser:
- Word Net and Ontology Synonym words: list, soccer
- Expanded Query: Name or list the football Soccer clubs in EEFA
Tradional keyword search will not be able to understand the difference between: USA Players in Catalan basket team Vs Catalan Palyers in USA teams. Such cases are not a problem for semantic search.
- t-SNE 300D to 2D for visualiztion
- Transfer Learning and Word Embedding.
- Address Bias in Word Embedding(2016) Word2vec (2013) and SVD?
- CBOW
- Skip Gram with negative Sampling Glove (2014) SVD , similar to PCA Attention Model(2014)
- Analogies
- Predicting next word, using sequence modeling
- Fill up the blanks :)
- Sentiment Analysis
- Word Embeddings is best, when small train labeled examples
- Use average or Sum of all the embeddings and that can work. However it fails for Sarcarm examples. Ex: lacking good taste.
- Use RNNs or LSTM in that scenario
- Machine Translation and Captioning an image
- Greedy Search
- Beam Search
- Speech Recognition
- Topic Modeling