The program lsi.py
implements the a simple latent semantic analysis engine using svd. Simple changes can be made to the program to try out Non-negative matrix factorization or Vector quantization.
It implements the following:
- Given a document title, it outputs k similar documents
- Given any word, it outputs k related words from all the documents. If this word occurs in none of the documents it outputs k random words
- Given a query, it outputs k relevant documents for the query.
The code has been optimised to work well in large cases as well. The addition methods of removing stopwords
, tfidf
and normalising
are implemented within the same file but are kept commented. A user can simply uncomment the required things and get it working with only slight modifications.
Needs scipy,numpy and a few other basic python libraries. To save yourself from the struggle of setting up the environment, use the requirements.txt file to setup the virtual environment for python
virtualenv venv
source venv/bin/activate
pip install requirements.txt
To deactivate the virtualenv use: deactivate
lsi.py
can be run as follows:
python lsi.py -z 200 -k 10 --dir Directory --doc_in <name of input document file> --doc_out <name of output document file to be generated by code> --term_in <name of input term file> --term_out <name of output term file to be generated by code> --query_in <name of input query file> --query_out <name of output query file to be generated by code>
where
-z: Dimensionality of lower dimensional space
-k: # of similar terms/documents to be returned
--dir: Directory containing input documents
--doc_in: Input file containing list of document titles (one per line) corresponding to whom k similar documents are to be returned.
--doc_out: Each line of this file will have titles of k documents (separated by ';<tab>' i.e semicolon followed by tab) that are similar to the document in corresponding line of doc_in
--term_in: Input file containing list of words (one per line) corresponding to whom k similar words/terms are to be returned.
--term_out: Each line of this output file will have k words (separated by ';<tab>' i.e semicolon followed by tab) that are similar to the word in corresponding line of term_in
--query_in: Input file containing list of queries (one per line) corresponding to whom k relevant documents are to be returned.
--query_out: Each line of this output file will have titles of k documents (separated by ';<tab>' i.e semicolon followed by tab) that are relevant to the query in corresponding line of query_in
- Make a fork
- branchout naming the new branch as an abbreviation of the feature
- implement the new feature
- send pull request
MIT