/Latent-Semantic-Indexing

Vanilla implementation of a basic search engine that uses LSI to index documents and retrieve results to search queries. The repository also includes the document set used for experimentation, the benchmark queries and the results obtained on them.

Primary LanguagePython

Latent Semantic Analysis

Introduction

The program lsi.py implements the a simple latent semantic analysis engine using svd. Simple changes can be made to the program to try out Non-negative matrix factorization or Vector quantization.

It implements the following:

  • Given a document title, it outputs k similar documents
  • Given any word, it outputs k related words from all the documents. If this word occurs in none of the documents it outputs k random words
  • Given a query, it outputs k relevant documents for the query.

The code has been optimised to work well in large cases as well. The addition methods of removing stopwords, tfidf and normalising are implemented within the same file but are kept commented. A user can simply uncomment the required things and get it working with only slight modifications.

Setting up the environment?

Needs scipy,numpy and a few other basic python libraries. To save yourself from the struggle of setting up the environment, use the requirements.txt file to setup the virtual environment for python

  • virtualenv venv
  • source venv/bin/activate
  • pip install requirements.txt

To deactivate the virtualenv use: deactivate

Running the latent semantic search engine?

lsi.py can be run as follows:

python lsi.py -z 200 -k 10 --dir Directory --doc_in <name of input document file> --doc_out <name of output document file to be generated by code> --term_in <name of input term file> --term_out <name of output term file to be generated by code> --query_in <name of input query file> --query_out <name of output query file to be generated by code>

where
-z: Dimensionality of lower dimensional space
-k: # of similar terms/documents to be returned
--dir: Directory containing input documents
--doc_in: Input file containing list of document titles (one per line) corresponding to whom k similar documents are to be returned.
--doc_out: Each line of this file will have titles of k documents (separated by ';<tab>' i.e semicolon followed by tab) that are similar to the document in corresponding line of doc_in
--term_in: Input file containing list of words (one per line) corresponding to whom k similar words/terms are to be returned.
--term_out: Each line of this output file will have k words (separated by ';<tab>' i.e semicolon followed by tab) that are similar to the word in corresponding line of term_in
--query_in: Input file containing list of queries (one per line) corresponding to whom k relevant documents are to be returned.
--query_out: Each line of this output file will have titles of k documents (separated by ';<tab>' i.e semicolon followed by tab) that are relevant to the query in corresponding line of query_in
Note: The documents in the directory must be numbered 1,2,3,4....n

Contributing

  • Make a fork
  • branchout naming the new branch as an abbreviation of the feature
  • implement the new feature
  • send pull request

License

MIT