/RabinKarp

Primary LanguageRuby

Setup
=====

  Install following gems first.

    % gem install rspec rbench


Benchmark
=========

    % rake spec


Data
====

  First, put some data file(utf-8) into data directory as "data.u8".

    % mkdir data/
    % cp <some_dir>/file data/data.u8

  Convert the data file to utf-32. (needs "iconv(1)")

    % rake data

  This creates following files.

    data : "data/data.u32"
    index: "data/data.idx"

Format
======
  * data (data/data.u32)

    UTF-32 text file (little endian with BOM(FF FE 00 00))
    where each lines are splited by CRLF(0A 00 00 00)

  * index (data/data.idx)

    ASCII text file (CRLF)
    that stores line offset information for above data file.

    (example)
       4
       20
       32


Test UTF-32 match
==============

  After creating data.u32 and data.idx, compile the program and execute it.

    % g++ -Wall -I lib/rk_u32.cpp
    % ./a.out word
 
  The program prints a number of match and writes "result" that contains
  the matched lines. We assume the search word is EUC-JP and "result" is  
  in UTF-32. To check the result, use a command like this:

    % iconv -f utf32 -t euc-jp result | lv