ISM: Interesting Sequence Miner

ISM is a novel algorithm that mines the subsequences that are most interesting under a probablistic model of a sequence database. Our model is able to efficiently infer interesting subsequences directly from the database.

This is an implementation of the sequence miner from our paper:
A Subsequence Interleaving Model for Sequential Pattern Mining
J. Fowkes and C. Sutton. KDD 2016.

Installation

Installing in Eclipse

Simply import as a maven project into Eclipse using the File -> Import... menu option (note that this requires m2eclipse).

It's also possible to export a runnable jar from Eclipse using the File -> Export... menu option.

Compiling a Runnable Jar

To compile a standalone runnable jar, simply run

mvn package

in the top-level directory (note that this requires maven). This will create the standalone runnable jar sequence-mining-1.0.jar in the sequence-mining/target subdirectory. The main class is sequencemining.main.SequenceMining (see below).

Running ISM

ISM uses a Bayesian Network Model to determine which subsequences are the most interesting in a given dataset.

Mining Interesting Sequences

Main class sequencemining.main.SequencesMining mines subsequences from a specified sequences database file. It has the following command line options:

-f database file to mine (in SPMF format)
-i max. no. iterations
-s max. no. structure steps
-r max. runtime (min)
-l log level (INFO/FINE/FINER/FINEST)
-v print to console instead of log file

See the individual file javadocs in sequencemining.main.SequenceMining for information on the Java interface. In Eclipse you can set command line arguments for the ISM interface using the Run Configurations... menu option.

Example Usage

A complete example using the command line interface on a runnable jar. We can mine the provided example dataset example.dat as follows:

$ java -jar sequence-mining/target/sequence-mining-1.0.jar -i 100 -f example.dat -v

which will output to the console. Omitting the -v flag will redirect output to a log-file in /tmp/.

Input/Output Formats

Input Format

ISM takes as input a sequence database file in SPMF format. The SPMF format is very simple: each line of the input file represents a database sequence and each sequence is a list of items, represented by positive integers, separated by -1 and ending with -2. For example, the first few lines (database sequences) from example.dat are:

1 -1 2 -1 3 -1 4 -1 -2
3 -1 5 -1 6 -1 4 -1 -2
3 -1 4 -1 -2
3 -1 5 -1 6 -1 7 -1 8 -1 4 -1 -2
3 -1 4 -1 -2

Note that any other item formats (e.g. words for text corpora) need to be manually mapped to (and from) positive integers by means of a dictionary.

Output Format

ISM outputs a list of interesting sequences, one sequence per line, ordered first by their interestingness (given in the 'int' column) followed by their probability (given in the 'prob' column). For example, the first few lines of output for the usage example above are:

============= INTERESTING SEQUENCES =============
[3] prob: 1.00000   int: 1.00000 
[4] prob: 1.00000   int: 1.00000 
[7, 8]  prob: 0.47500   int: 1.00000 
[5, 6]  prob: 0.32000   int: 1.00000 
[9] prob: 0.10500   int: 1.00000 
[12]    prob: 0.02000   int: 1.00000 
[16]    prob: 0.01000   int: 1.00000 
[9, 10, 5, 6, 10, 6, 9, 10, 5, 6]   prob: 0.01000   int: 1.00000 
[11, 7, 8, 5, 6]    prob: 0.00500   int: 1.00000 
[13, 14, 15, 13, 14, 5, 6]  prob: 0.00500   int: 1.00000

See the accompanying paper for details of how to interpret 'interestingness' and 'probability' under ISM's probabilistic model.

Datasets

The datasets used in the paper are available in the datasets subdirectory. All datasets are in SPMF format (see above). The classification datasets additionally include the class labels for each transaction in a .lab file.

Bugs

Please report any bugs using GitHub's issue tracker.

License

This algorithm is released under the GNU GPLv3 license. Other licenses are available on request.

jinfuchen/sequence-mining