/lzbg

Automatically exported from code.google.com/p/lzbg

Primary LanguageCGNU General Public License v2.0GPL-2.0

-------------------------------------------------------------

This repository contains implementations of a new, simple, 
and fast approach for LZ77 factorization.

-------------------------------------------------------------
Simple explanation of approach
-------------------------------------------------------------
Our approach basically computes the PSV and NSV values
based on the suffix array SA, where

  PSV[i] = max { j < i | SA[j] < SA[i] }
  NSV[i] = min { j > i | SA[j] < SA[i] }

which can be computed by a process called peak elimination.
Then,

  LPF[i] = max { lcp(T[j:N], T[SA[i]:N]) | j < SA[i] }
         = max { lcp(T[SA[PSV[i]]:N], T[SA[i]:N]),
                 lcp(T[SA[NSV[i]]:N], T[SA[i]:N]) }.

Where, by definition, an LZ77 factor starting at position 
SA[i] in the text will have length LPF[i].
(above description is in lex order, but text order is 
 considered depending on the variant).

Most previous algorithms compute LPFs for all 1 \leq i \leq N.
However, our algorithm avoids computing all of these values,
but rather computes the LZ77 parse directly from
PSV and NSV, making the algorithm simpler and faster.

There are 6 variants: BGS, BGL, BGT and iBGS, iBGL, and iBGT.
All run in linear time, given the suffix array of the string.
Working memory is as follows:

BGS, iBGS: 17*N Bytes + stack
BGL, iBGL: 17*N Bytes
BGT, iBGT: 13*N Bytes

-------------------------------------------------------------
Files
-------------------------------------------------------------

Each of the above variants are coded in the following files:

 bgCommon.hpp: header file for common utility functions
 bgCommon.cpp: implementation of common utility functions

 bgsMain.cpp: peak elimination in lex order using stack
ibgsMain.cpp: peak elimination in lex order using stack, interleaving PSV,NSV

 bglMain.cpp: peak elimination in lex order
ibglMain.cpp: peak elimination in lex order, interleaving PSV,NSV

 bgtMain.cpp: peak elimination in text order with the help of \Phi
ibgtMain.cpp: peak elimination in text order with the help of \Phi, interleaving PSV,NSV

where interleaving PSV, NSV means that they are stored in a 
single array PNSV of length 2*N and PNSV[2*i] = PSV[i] and PNSV[2*i+1] = NSV[i].

Computational experiments so far have shown that iBGS seems to be
fastest except for extremely repetitive data, where iBGT seems to be fastest.

We also provide implementation for the algorithm shown in:
 E. Ohlebusch & S. Gog, "Lempel-Ziv Factorization Revisited",
 In Proc. CPM 2011, LNCS 6661:15-26, 2011.
(our implementation is based on the pseudo-code in the paper,
 and the performance is basically equal to the implementation
 that Dr. Simon Gog shared with us.)

These are implemented in:
 ogMain.cpp: LZ_OG
iogMain.cpp: LZ_OG with interleaving LPS and PrevOcc
LZ_OG requires 13*N Bytes of working memory.

The files:
  divsufsort.h
  divsufsort.c
are from libdivsufsort-lite by Yuta Mori, for computing the
suffix array of a string.

-------------------------------------------------------------
Usage:
-------------------------------------------------------------
The programs can be compiled via the SCons build tool:

  http://scons.org

just type 'scons' in the directory. 
The SConstruct file should be edited if you need to modify
options passed to the compiler.

The following executables will be produced:

lzBGL
lzBGS
lzBGT
lzOG
lziBGL
lziBGS
lziBGT
lziOG

All usage is the same for all the programs:

Usage  : ./lzBGL [options]
Options: 
  -f iFile : file to process
  -x       : use iFile + '.sa' for suffix array cache
  -g       : check if resulting factorization produces input string

if -x is specified, the program will also look for a file with 
extension '.sa' appended to the input filename, and use it as the 
suffix array if it exists, or create/overwrite the file if it does 
not exist or seems to be out of date.

The LZ factorization is returned in:
   std::vector<std::pair<int,int> > lz;
which is a sequence of
  (length of factor, previous occurrence) if LPF > 0, or (0, T[p]).

Currently the programs output only simple statistics based on lz.

-------------------------------------------------------------
Authors:
 Hideo Bannai
 Keisuke Goto
-------------------------------------------------------------