=========== Pattern HMM =========== This Pattern HMM package implements a profile hidden Markov model useful for identifying patterns of symbols in a sequence of strings. Installation ============ NOTE: This package is not set up to be installed by cloning or pulling from GitHub. Instead, download the latest release file. This package is small, containing several Python files and one C-extension. Installation is done using the standard ``distutils`` software, and so should be relatively straight forward. Unpack the distribution file (either the .tar.gz or the .zip file), and change directory into it. Then install the package with the command: ``python setup.py install`` You may need super-user privileges to create the necessary directories, in which case the command would be: ``sudo python setup.py install`` On Windows the command should be just: ``setup.py install`` This should compile the C-extension and move all of the relevant files to your default location for third-party Python modules: * On Unix this should be ``/usr/local/lib/pythonX.Y/site-packages`` or ``/usr/local/lib/pythonX.Y/dist-packages``. * On MacOSX this should be ``/Library/Python/X.Y/site-packages``. * On Windows this should be ``C:\PythonXY\Lib\site-packages``. Once this is finished, you may delete the distribution folder. Usage ===== As an example, let us say that we are searching for the pattern "ABCDE" in some sequence of characters. This pattern must be defined using a model object, which itself is defined by 2 dictionaries, one describing the emission probabilities of the match states, and the other describing the transition probabilities between model states. This package includes a function that creates a template model file to fill out. It is meant to be accessed from the Python interpreter as:: >>> import patternHmm >>> patternHmm.generate_model_file(5, 'test_model.py') Here, the first argument is an integer, indicating how long your pattern is (5 in our example). The second argument is a path describing where to save the template file. If you open ``test_model.py``, you can see the default parameters already filled out. These can be changed as desired to fine tune the model, but should work just fine as they are for this example. If you were searching for some other pattern, you would probably modify the ``matchEmissions`` dictionary, substituting your particular symbols (letters or numbers) instead of just the ascending alphabet. The typical way of using this model file is to then create a new Python file, the run file. This file should contain the sequence of interest, import the model file, and then use one of the model functions to find the pattern matches within the sequence. It might look something like:: import test_model model = test_model.model sequence = 'efABgCDEhiACjkDEfg' model.print_matches(sequence) This will output the 2 matches that the model found within the sequence, aligned to the sequence of predicted match and insert states. The other common model method is ``model.print_path(sequence)``, which just prints the full alignment of the sequence and the predicted sequence of states that might have generated it. If this package is being used as part of some other program, there are versions of those 2 methods that don't print anything, but instead just return the information: ``model.find_matches(sequence)`` and ``model.find_path(sequence)``. For a more complete reference, see the documentation in the ``patterHmm/__init__.py`` file. Usage Notes =========== * The symbols defined in the ``matchEmissions`` dictionary in the model file can be any length of characters, but may only contain alpha-numeric characters. * The sequence can be either a string or a list of symbols. However, if any of the symbols in your pattern are longer than one character, the sequence must be a list. * A match state can emit any number of possible symbols, not just one. For example, if we wanted the third symbol in our pattern to be either 'C' or 'Z', the third entry in the ``matchEmissions`` dictionary should be modified to ``'M3': {'C':1.0, 'Z':1.0},``. * If the most important part of our pattern is the 'B', we might attribute it more weight by modifying the second entry in the ``matchEmissions`` dictionary to ``'M2': {'B':1.5},``. * If the C extension was compiled but is not working correctly, or if you wish to use the Python implementation for some reason (described in the section below), it can be done by changing the first line of your model file to read: ``from patternHmm.pyProfileHmm import Hmm``. Implementation ============== This package implements two versions of the Viterbi algorithm to predict the most likely sequence of states that might generate some given sequence. The default version performs the calculations in C (Viterbi.c), but if there is a problem compiling the extension into a .so format, the program will automatically use the Python implementation (pyProfileHmm.py) instead. This version requires numpy to be installed, and is ~200 x slower than the C version, in addition to using more memory. If the C implementation is being used, there is no requirement that numpy be installed on the machine.
dave-the-scientist/patternHmm
Matching patterns of symbols using a profile Hidden Markov Model
PythonGPL-3.0