/ir-lm

Emacs extension realizing basic mixed language model in information retrieval system.

Primary LanguageEmacs LispGNU General Public License v3.0GPL-3.0

This is an Emacs Lisp extension realizing a simple mixed language
model for information retrieval in paragraphs of files grouped in
multiple directories.
Paragraphs are assumed to be separated by a blank line.  The formula
used for sorting relevance is:
P(w|p) = lambda*P(w|Mp) + (1 - lambda)*P(w|Mc)
where Mp is probabilities model for a paragraph, Mc - probabilities
model for the whole collection and 0 < lambda < 1.
It's only been tested on cvs version of GNU/Emacs 23.1.50 onwards.

Files:
ir-lm.el (elisp source)
optional:
bg-stop-words.txt, stem-rules.txt, stem-rules2.txt, stem-rules3.txt

Adding (require 'ir-lm) to .emacs allows automatic loading of the
extension (after adding to load-path).  Optional files are grouped in
directory "~/.ir/" on posix systems or "C:\\ir\" on windows systems.
Stop word files should be put in subdirectory "stop-words/".  Files
with stemming rules must be in subdirectory "stem-rules/".  These
directories are recursively scanned so any sort of subdirectory
structures would suffice and all files will be used accordingly.  Stop
word files may contain different languages but stemming rules
processing is tuned only for bulgarian at the moment (not really a
problem to make a directory for each needed language with stemming
rules and then adding some specific functions for loading these rules
and stemming for each such subdirectory).

Commands:
ir-lm-index
Creates and loads an index of all files of directory and its
subdirectories.  Index is not saved on disk.  The option for file
types allows multiple glob filters (separated with space) to be
applied to file names, thus indexing just specific files.  The coding
option allows choosing non-default encoding for all files.  The option
for adding to current index determines whether index is freshly loaded
deleting current index or merging current and new indexes thus
allowing search in multiple indexes (treated as one from that moment).
If auxiliary files (stop words, stem rules) have not been loaded -
attempts to load them.

ir-lm-write-index
Writes current index to a chosen file.

ir-lm-load-index
Loads an index from a chosen file.  The option for adding to current
configuration is analogous to the option in ir-lm-index, allowing
merging of multiple indexes.  Index information for files not present
in the file system (as recorded in the index) is not loaded.  When
merging indexes having information for identical (according to path)
files, most recently indexed version for such files is chosen.  If
auxiliary files (stop words, stem rules) have not been loaded -
attempts to load them.

ir-lm-search
Searches indexed paragraphs for words showing a line resumes in a new
buffer and links to the result paragraphs.  When a result is clicked
(enter also suffices), marker is positioned upon the result paragraph
and search terms are coloured.

ir-clear
Freeing index data from memory.  (there probably is a problem with the
way it's done, as Emacs keeps showing as high memory use)

ir-lm-change-lambda
Allowing modifying the lambda search parameter for this language model
(defaults to 0.5)

ir-change-stem-level
Changes stemming level (it only applies for newly indexing and when
auxiliary hashes, stop-words and stem-rules have been cleared or not
yet loaded).

ir-lm-change-max-results
Changes maximum number of search results showed (defaults to 30).

ir-lm-change-min-words
Changes the minimum number of words needed for a paragraph to be
indexed (defaults to 20) on new indexing.

ir-lm
A convenient interface for the above commands as well as links to all
files in current index.