CIndex[1] is a compressed index for FASTQ files. It uses the Burrows-Wheeler transform[2] and the wavelet tree[3], combined with hybrid encoding and succinct data structures, to achieve minimal space usage while enabling fast retrieval on the compressed FASTQ files.
CIndex uses compiler c++ 11, g++ 5.4 and above, with sse4.2, and Ubuntu16.04.1 LTS 64.
$ git clone https://github.com/Hongweihuo-Lab/CIndex.git
$ cd Cindex
$ make
By entering different commands after –z, you can enter different operating modes.
The operation modes include: build, cr, lr, er, efq, which correspond to the construction of compressed indexes, reads counting, reads positioning, reads extraction, and fastq extraction, respectively.
Create the compressed index of the FASTQ file.
The command to create the compressed index for the FASTQ file is:
Command | Description |
---|---|
./CIndex -f fastqpath -z build | -f fastqpath is the path of the FASTQ file to be compressed. |
-z build means to build the compressed index. |
The index file will be generated under the folder corresponding to the FASTQ file after the operation is completed.
CIndex supports four queries: CountR (cr), LocateR (lr), ExtractR (er), extractFASTQ (efq), listed as follows:
Command | Description |
---|---|
./CIndex –f fastqpath –z cr | Input P, it returns the number of occurrences of pattern P in reads strings R. |
./CIndex –f fastqpath –z lr | Input P, it reports all the read line numbers that contain an occurrence of P. |
./CIndex –f fastqpath –z er | Input read line number, it extracts the read associated with the given read line number. |
./CIndex -f fastqpath -z efq | Input P, it extracts the collection of records that contain P. |
[1] H. Huo, P. Liu, C. Wang, H. Jiang, and J. S. Vitter, CIndex: Compressed indexes for fast retrieval of FASTQ files, Bioinformatics, September 15, 2021. https://doi.org/10.1093/bioinformatics/btab655
[2] M. Burrows and D.J. Wheeler, A block-sorting lossless data compression algorithm, Tech. Report SRC-RR-124, Digital Equipment Corporation, Palo Alto, CA, 1994.
[3] R. Grossi, A. Gupta, and J. S. Vitter, High-order entropy-compressed text indexes, In Proceedings of the 14th annual ACM-SIAM symposium on Discrete Algorithms, 2003, pp. 841–850.