Setup ===== Install following gems first. % gem install rspec rbench Benchmark ========= % rake spec Data ==== First, put some data file(utf-8) into data directory as "data.u8". % mkdir data/ % cp <some_dir>/file data/data.u8 Convert the data file to utf-32. (needs "iconv(1)") % rake data This creates following files. data : "data/data.u32" index: "data/data.idx" Format ====== * data (data/data.u32) UTF-32 text file (little endian with BOM(FF FE 00 00)) where each lines are splited by CRLF(0A 00 00 00) * index (data/data.idx) ASCII text file (CRLF) that stores line offset information for above data file. (example) 4 20 32 Test UTF-32 match ============== After creating data.u32 and data.idx, compile the program and execute it. % g++ -Wall -I lib/rk_u32.cpp % ./a.out word The program prints a number of match and writes "result" that contains the matched lines. We assume the search word is EUC-JP and "result" is in UTF-32. To check the result, use a command like this: % iconv -f utf32 -t euc-jp result | lv