/cLCP-mACS

Efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances.

Primary LanguageC++

cLCP-mACS

Efficient lightweight strategy to solve the multi-string Average Common Substring (ACS) problem, that consists in the pairwise comparison of a single string against a collection of m strings simultaneously, in order to obtain m ACS induced distances.

This software is an implementation of the algorithm described in:

F. Garofalo, G. Rosone, M. Sciortino and D. Verzotto The colored longest common prefix array computed via sequential scans. Proceedings of String Processing and Information Retrieval - 25th International Symposium, SPIRE 2018, Lecture Notes in Computer Science, Springer 1

Install

git clone https://github.com/giovannarosone/cLCP-mACS.git
cd cLCP-mACS
make all

Run

./cLCP-mACS [-h] [-v] [-p] [-l] [-f input_format] [-Q amount] ref_seq target_seqs ref_color output
Input

cLCP-mACS takes as input the GESA (Generalized Enhanced Suffix Arrays) of a collection of sequences (target collection) and the GESA of one of the sequence in the collection (reference sequence). At the current state, these files are assumed as resulting by eGSA tool computation:

ref_seq       GESA file name of reference sequence (without .gesa extension)
target_seqs   GESA file name of target collection (without .gesa extension)
ref_color     ID of reference sequence in the target collection
Output

cLCP-mACS computes the multi-ACS measure between the reference sequence and the remaining sequences of the target collection. The computation collaterally produces in addition to the the file with the distance values (.acs) two working files containing the array D (.d) and a partial cLCP (.xclcp).

output       Output/Working files name
Options:
-h    help message
-v    verbose textual output

The option -f input_format specifies the format of the input to cLCP-mACS. For now, only -f 1 option is admitted (by default) corresponding to the GESA computed by eGSA.

The option -p tells if the preprocessing step can be skipped. You should use this option only if the target collection has been already processed in a previous execution (this is indicated by the presence of a .info file, and files .bwt,.lcp,.id).

Option -l also skips preprocessing step provided that are available a .lenSeqs.aux file and .bwt,.lcp,.id files (as produced by BCR tool).

The option -Q amount dictates the amount of RAM (in Bytes) available to accomodate partial cLCP page. amount has to be at least 2 x m, where m is the number of sequences in target collection.

Contributors

Fabio Garofalo, University of Palermo

Marinella Sciortino, University of Palermo

Giovanna Rosone, University of Pisa (project lead)

Davide Verzotto, University of Pisa

Citation

If you use cLCP-mACS you could cite the following paper:

@inproceedings{GRSV_Spire18,
author    = {Garofalo, Fabio and Rosone, Giovanna and Sciortino, Marinella and Verzotto, Davide},
title     = {The colored longest common prefix array computed via sequential scans},
booktitle = {String Processing and Information Retrieval - 25th International Symposium,
           {SPIRE} 2018, Proceedings},
pages     = {},
year      = {2018},
series    = {Lecture Notes in Computer Science},
volume    = {},
publisher = {Springer}
}

  1. Supported by the project Italian MIUR-SIR CMACBioSeq ("Combinatorial methods for analysis and compression of biological sequences") grant n.~RBSI146R5L.