SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
- Documents: http://bioinf.shenwei.me/seqkit (Usage, FAQ, Tutorial, and Benchmark)
- Source code: https://github.com/shenwei356/seqkit
- Latest version:
- Please cite:
Introduction
FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and protein sequences. Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly.
This project describes a cross-platform ultrafast comprehensive toolkit for FASTA/Q processing. SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OS X, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations.
Table of Contents
- Features
- Subcommands
- Installation
- Bash-completion
- Technical details and guides for use
- Usage && Examples
- Benchmark
- Citation
- Contributors
- Acknowledgements
- Contact
- License
Features
- Cross-platform (Linux/Windows/Mac OS X/OpenBSD/FreeBSD, see download)
- Light weight and out-of-the-box, no dependencies, no compilation, no configuration (see download)
- UltraFast (see benchmark), multiple-CPUs supported
- Practical functions supported by 28 subcommands (see subcommands and usage )
- Supporting Bash-completion
- Well documented (detailed usage and benchmark )
- Seamlessly parsing both FASTA and FASTQ formats
- Supporting STDIN and gzipped input/output file, easy being used in pipe
, writing gzip file is very fast (10X of
gzip
, 4X ofpigz
) by using package pgzip - Supporting custom sequence ID regular expression (especially useful for searching with ID list)
- Reproducible results (configurable rand seed in
sample
andshuffle
) - Well organized source code, friendly to use and easy to extend
Features comparison
Categories | Features | seqkit | fasta_utilities | fastx_toolkit | pyfaidx | seqmagick | seqtk |
---|---|---|---|---|---|---|---|
Formats support | Multi-line FASTA | Yes | Yes | -- | Yes | Yes | Yes |
FASTQ | Yes | Yes | Yes | -- | Yes | Yes | |
Multi-line FASTQ | Yes | Yes | -- | -- | Yes | Yes | |
Validating sequences | Yes | -- | Yes | Yes | -- | -- | |
Supporting RNA | Yes | Yes | -- | -- | Yes | Yes | |
Functions | Searching by motifs | Yes | Yes | -- | -- | Yes | -- |
Sampling | Yes | -- | -- | -- | Yes | Yes | |
Extracting sub-sequence | Yes | Yes | -- | Yes | Yes | Yes | |
Removing duplicates | Yes | -- | -- | -- | Partly | -- | |
Splitting | Yes | Yes | -- | Partly | -- | -- | |
Splitting by seq | Yes | -- | Yes | Yes | -- | -- | |
Shuffling | Yes | -- | -- | -- | -- | -- | |
Sorting | Yes | Yes | -- | -- | Yes | -- | |
Locating motifs | Yes | -- | -- | -- | -- | -- | |
Common sequences | Yes | -- | -- | -- | -- | -- | |
Cleaning bases | Yes | Yes | Yes | Yes | -- | -- | |
Transcription | Yes | Yes | Yes | Yes | Yes | Yes | |
Translation | Yes | Yes | Yes | Yes | Yes | -- | |
Filtering by size | Yes | Yes | -- | Yes | Yes | -- | |
Renaming header | Yes | Yes | -- | -- | Yes | Yes | |
Other features | Cross-platform | Yes | Partly | Partly | Yes | Yes | Yes |
Reading STDIN | Yes | Yes | Yes | -- | Yes | Yes | |
Reading gzipped file | Yes | Yes | -- | -- | Yes | Yes | |
Writing gzip file | Yes | -- | -- | -- | Yes | -- |
Note 1: See version information of the softwares.
Note 2: See usage for detailed options of seqkit.
Subcommands
33 functional subcommands in total.
Sequence and subsequence
seq
transform sequences (revserse, complement, extract ID...)subseq
get subsequences by region/gtf/bed, including flanking sequencessliding
sliding sequences, circular genome supportedstats
simple statistics of FASTA/Q filesfaidx
create FASTA index file and extract subsequencewatch
monitoring and online histograms of sequence featuressana
sanitize broken single line fastq filesscat
real time concatenation and streaming of fastx files
Format conversion
fx2tab
convert FASTA/Q to tabular format (and length/GC content/GC skew)tab2fx
convert tabular format to FASTA/Q formatfq2fa
convert FASTQ to FASTAconvert
convert FASTQ quality encoding between Sanger, Solexa and Illuminatranslate
translate DNA/RNA to protein sequence (supporting ambiguous bases)
Searching
grep
search sequences by ID/name/sequence/sequence motifs, mismatch allowedlocate
locate subsequences/motifs, mismatch allowedfish
look for short sequences in larger sequences using local alignmentamplicon
retrieve amplicon (or specific region around it) via primer(s)
BAM processing and monitoring
bam
monitoring and online histograms of BAM record features
Set operations
head
print first N FASTA/Q recordsrange
print FASTA/Q records in a range (start:end)sample
sample sequences by number or proportionrmdup
remove duplicated sequences by id/name/sequenceduplicate
duplicate sequences N timescommon
find common sequences of multiple files by id/name/sequencesplit
split sequences into files by id/seq region/size/parts (mainly for FASTA)split2
split sequences into files by size/parts (FASTA, PE/SE FASTQ)
Edit
replace
replace name/sequence by regular expressionrename
rename duplicated IDsrestart
reset start position for circular genomeconcat
concatenate sequences with same ID from multiple filesmutate
edit sequence (point mutation, insertion, deletion)
Ordering
Misc
version
print version information and check for updategenautocomplete
generate shell autocompletion script
Installation
Go to Download Page for more download options and changelogs.
SeqKit
is implemented in Go programming language,
executable binary files for most popular operating systems are freely available
in release page.
Method 1: Download binaries (latest stable/dev version)
Just download compressed
executable file of your operating system,
and decompress it with tar -zxvf *.tar.gz
command or other tools.
And then:
-
For Linux-like systems
-
If you have root privilege simply copy it to
/usr/local/bin
:sudo cp seqkit /usr/local/bin/
-
Or copy to anywhere in the environment variable
PATH
:mkdir -p $HOME/bin/; cp seqkit $HOME/bin/
-
-
For windows, just copy
seqkit.exe
toC:\WINDOWS\system32
.
Method 2: Install via conda (latest stable version)
conda install -c bioconda seqkit
Method 3: Install via homebrew (latest stable version)
brew install brewsci/bio/seqkit
Method 4: For Go developer (latest stable/dev version)
go get -u github.com/shenwei356/seqkit/seqkit
Method 5: Docker based installation (latest stable/dev version)
git clone this repo:
git clone https://github.com/shenwei356/seqkit
Run the following commands:
cd seqkit
docker build -t shenwei356/seqkit .
docker run -it shenwei356/seqkit:latest
Bash-completion
Note: The current version supports Bash only. This should work for *nix systems with Bash installed.
Howto:
-
run:
seqkit genautocomplete
-
create and edit
~/.bash_completion
file if you don't have it.nano ~/.bash_completion
add the following:
for bcfile in ~/.bash_completion.d/* ; do . $bcfile done
Technical details and guides for use
FASTA/Q format parsing
SeqKit uses author's lightweight and high-performance bioinformatics packages bio for FASTA/Q parsing, which has high performance close to the famous C lib klib (kseq.h).
Seqkit calls
Seqkit does not call pigz
(much faster than gzip
) or gzip
to decompress .gz file if they are available.
So please install pigz to gain better parsing performance for gzipped data.pigz
or gzip
any more since v0.8.1,
Because it does not always increase the speed.
But you can still utilize pigz
or gzip
by pigz -d -c seqs.fq.gz | seqkit xxx
.
Seqkit uses package pgzip to write gzip file,
which is very fast (10X of gzip
, 4X of pigz
) and the gzip file would be slighty larger.
Sequence formats and types
SeqKit seamlessly support FASTA and FASTQ format.
Sequence format is automatically detected.
All subcommands except for faidx
can handle both formats.
And only when some commands (subseq
, split
, sort
and shuffle
)
which utilise FASTA index to improve perfrmance for large files in two pass mode
(by flag --two-pass
), only FASTA format is supported.
Sequence type (DNA/RNA/Protein) is automatically detected by leading subsequences
of the first sequences in file or STDIN. The length of the leading subsequences
is configurable by global flag --alphabet-guess-seq-length
with default value
of 10000. If length of the sequences is less than that, whole sequences will
be checked.
Sequence ID
By default, most softwares, including seqkit
, take the leading non-space
letters as sequence identifier (ID). For example,
FASTA header | ID |
---|---|
>123456 gene name | 123456 |
>longname | longname |
>gi|110645304|ref|NC_002516.2| Pseudomona | gi|110645304|ref|NC_002516.2| |
But for some sequences from NCBI,
e.g. >gi|110645304|ref|NC_002516.2| Pseudomona
, the ID is NC_002516.2
.
In this case, we could set sequence ID parsing regular expression by global flag
--id-regexp "\|([^\|]+)\| "
or just use flag --id-ncbi
. If you want
the gi
number, then use --id-regexp "^gi\|([^\|]+)\|"
.
FASTA index
For some commands, including subseq
, split
, sort
and shuffle
,
when input files are (plain or gzipped) FASTA files,
FASTA index would be optional used for
rapid access of sequences and reducing memory occupation.
ATTENTION: the .seqkit.fai
file created by SeqKit is slightly different from .fai
file
created by samtools
. SeqKit uses full sequence head instead of just ID as key.
Parallelization of CPU intensive jobs
The validation of sequences bases and complement process of sequences are parallelized for large sequences.
Parsing of line-based files, including BED/GFF file and ID list file are also parallelized.
The Parallelization is implemented by multiple goroutines in golang
which are similar to but much
lighter weight than threads. The concurrency number is configurable with global
flag -j
or --threads
(default value: 1 for single-CPU PC, 2 for others).
Memory occupation
Most of the subcommands do not read whole FASTA/Q records in to memory,
including stat
, fq2fa
, fx2tab
, tab2fx
, grep
, locate
, replace
,
seq
, sliding
, subseq
.
Note that when using subseq --gtf | --bed
, if the GTF/BED files are too
big, the memory usage will increase.
You could use --chr
to specify chromesomes and --feature
to limit features.
Some subcommands need to store sequences or heads in memory, but there are
strategy to reduce memory occupation, including rmdup
and common
.
When comparing with sequences, MD5 digest could be used to replace sequence by
flag -m
(--md5
).
Some subcommands could either read all records or read the files twice by flag
-2
(--two-pass
), including sample
, split
, shuffle
and sort
.
They use FASTA index for rapid acccess of sequences and reducing memory occupation.
Reproducibility
Subcommands sample
and shuffle
use random function, random seed could be
given by flag -s
(--rand-seed
). This makes sure that sampling result could be
reproduced in different environments with same random seed.
Usage && Examples
Benchmark
More details: http://bioinf.shenwei.me/seqkit/benchmark/
Datasets:
$ seqkit stat *.fa
file format type num_seqs sum_len min_len avg_len max_len
dataset_A.fa FASTA DNA 67,748 2,807,643,808 56 41,442.5 5,976,145
dataset_B.fa FASTA DNA 194 3,099,750,718 970 15,978,096.5 248,956,422
dataset_C.fq FASTQ DNA 9,186,045 918,604,500 100 100 100
SeqKit version: v0.3.1.1
FASTA:
FASTQ:
Citation
W Shen, S Le, Y Li*, F Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.
Contributors
- Wei Shen
- Botond Sipos for commands: bam, fish, sana, watch.
- others
Acknowledgements
We thank Lei Zhang for testing of SeqKit, and also thank Jim Hester, author of fasta_utilities, for advice on early performance improvements of for FASTA parsing and Brian Bushnell, author of BBMaps, for advice on naming SeqKit and adding accuracy evaluation in benchmarks. We also thank Nicholas C. Wu from the Scripps Research Institute, USA for commenting on the manuscript and Guangchuang Yu from State Key Laboratory of Emerging Infectious Diseases, The University of Hong Kong, HK for advice on the manuscript.
We thank Li Peng for reporting many bugs.
Contact
Email me for any problem when using seqkit. shenwei356(at)gmail.com
Create an issue to report bugs, propose new functions or ask for help.