Using ISM with large sequences
vrodriguezf opened this issue · 4 comments
Hi,
I am trying to use ISM for mining compressing sequential patterns in my sequence database. Unfortunately, I cannot manage to make the algorithm finish successfully, it gets stuck forever. I think that the problem is that the sequences are very large in comparison with the datasets I see in the datasets
folder.
I am analyzing a sequence database composed of 1134 sequences. The total number of symbols is 74, and the average number of distinct symbols per sequence is 35.69, with a standard deviation of 2.65. The average length is 1466.60, with an standard deviation of 2172.60. The maximum length of a sequence is 9980.
Even with an extract of 100 sequences, I cannot make ISM finish. Here is a link to the input file I am using, in the SPMF format:
link to the input file
I have tried to reduce the max. numer of iterations and the max. number of structure steps, but it does not work.
Maybe a solution could be splitting the large sequences into smaller sequences, but that could break some patterns and make the results inconsistent.
Any help here?
Many thanks for your great work!
Hello,
Many thanks for your question!
The problem is indeed that your sequences are very long, too long for ISM to handle in a reasonable amount of time. The reason for this is the way ISM works: it essentially solves a set cover problem for each database sequence (approximately using a greedy algorithm) to determine whether candidate sequential patterns would make good compressing patterns. For very long database sequences this set cover problem becomes prohibitively expensive and this is unfortunately a known shortcoming of ISM.
My advice would therefore be to do as you suggest and split the long sequences into a database of shorter ones. This is commonly done when mining protein sequences and you might be able to find some pointers on this in that literature (I am not very familiar with it myself). Of course the danger with this is, as you correctly point out, that you may miss some patterns, particularly the longer ones. Having said that, a colleague of mine managed to successfully apply a similar technique to mine closed frequent sequences from a protein database, so you may be able to get some results with this approach but your mileage may vary.
Hi,
Thank you so much for your prompt response!
Knowing that it is not an issue of myself using ISM wrongly, I think I will proceed to split the sequence database into smaller sequences. Do you have any idea, from your experience, about a good average sequence size to make ISM work properly?
I've only used ISM with the sequences in the datasets
folder, so I would suggest starting with average sequence sizes corresponding to those datasets to begin with.
Thank you!
I will make a preliminary study how splitting the sequences affect the final result of the algorithm in my dataset.
Best!