disco-slct is a mapreduce implementation of SLCT. According to the SLCT website:
SLCT is a tool that was designed to find clusters in logfile(s), so that each cluster corresponds to a certain line pattern that occurs frequently enough.
Examples of the clusters that SLCT, and thus disco-slct, is able to detect:
Dec 18 * myhost.mydomain sshd[*]: log: Connection from * port *
Dec 18 * myhost.mydomain sshd[*]: log: Password authentication for * accepted.
With the help of SLCT, one can quickly build a model of logfile(s), and also identify rare lines that do not fit the model (and are possibly anomalous). disco-slct uses Disco for it's backend.
First, you obviously need to install Disco. Instructions on how to do this can be found in the Disco installation guide.
Optionally you'd want to push your logfiles to DDFS if you have a lot of them. Information on how to do this can be found in the Disco tutorial.
Next, you choose a threshold, which is the mininum support value for each log pattern. This is the number of lines that each line outputted by disco-slct will match in your log files.
Next, execute:
$ python ./dslct.py -s <THRESHOLD> <DISCO_URL_TO_YOUR_LOGFILE>
You can always issue
$ python ./dslct.py --help
to see exact parameters to disco-slct.