SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Documents: http://bioinf.shenwei.me/seqkit (Usage, FAQ, Tutorial, and Benchmark)
Source code: https://github.com/shenwei356/seqkit
Latest version:
Please cite:
Others:

Features

Easy to install (download)
- Providing statically linked executable binaries for multiple platforms (Linux/Windows/macOS, amd64/arm64)
- Light weight and out-of-the-box, no dependencies, no compilation, no configuration
- conda install -c bioconda seqkit
Easy to use
- Ultrafast (see technical-details and benchmark)
- Seamlessly parsing both FASTA and FASTQ formats
- Supporting (gzip/xz/zstd/bzip2 compressed) STDIN/STDOUT and input/output file, easily integrated in pipe
- Reproducible results (configurable rand seed in sample and shuffle)
- Supporting custom sequence ID via regular expression
- Supporting Bash/Zsh autocompletion
Versatile commands (usages and examples)
- Practical functions supported by 37 subcommands

Installation

Go to Download Page for more download options and changelogs, or install via conda:

conda install -c bioconda seqkit

Subcommands

category	command	function	input	strand-sensitivity	multi-threads	popularity
basic	seq	transform sequences: extract ID/seq, filter by length/quality, remove gaps, reverse complement…	FASTA/Q			★★★★★
	stats	simple statistics: #seqs, min/max_len, N50, Q20%, Q30%…	FASTA/Q		✓	★★★★★
	sum	compute message digest for all sequences in FASTA/Q files	FASTA/Q	+ or both	✓
	subseq	extract subsequences or flanking sequences by region/gtf/bed,	FASTA/Q	+ or/and -		★★★
	sliding	extract subsequences in sliding windows	FASTA/Q	+ only		★★
	faidx	create FASTA index file and extract subsequence (with more features than samtools faidx)	FASTA	+ or/and -
	watch	monitoring and online histograms of sequence features	FASTA/Q
	sana	sanitize broken single line FASTQ files	FASTQ
	scat	real time concatenation and streaming of fastx files	FASTA/Q		✓
format conversion	fq2fa	convert FASTQ to FASTA	FASTQ			★★
	fa2fq	retrieve corresponding FASTQ records by a FASTA file	FASTA/Q
	fx2tab	convert FASTA/Q to tabular format	FASTA/Q			★★
	tab2fx	convert tabular format to FASTA/Q format	FASTA/Q
	convert	convert FASTQ quality encoding between Sanger, Solexa and Illumina	FASTA/Q
	translate	translate DNA/RNA to protein sequence	FASTA/Q	+ or/and -		★★
searching	grep	search sequences by ID/name/sequence/sequence motifs, mismatch allowed	FASTA/Q	+ and -	partly, -m	★★★★★
	locate	locate subsequences/motifs, mismatch allowed	FASTA/Q	+ and -	partly, -m	★★★★★
	amplicon	extract amplicon (or specific region around it), mismatch allowed	FASTA/Q	+ and -	partly, -m	★
	fish	look for short sequences in larger sequences	FASTA/Q	+ and -
set operation	sample	sample sequences by number or proportion	FASTA/Q			★★★★
	rmdup	remove duplicated sequences by ID/name/sequence	FASTA/Q	+ and -		★★★
	common	find common sequences of multiple files by id/name/sequence	FASTA/Q	+ and -
	duplicate	duplicate sequences N times	FASTA/Q			★
	split	split sequences into files by id/seq region/size/parts (mainly for FASTA)	FASTA preffered			★
	split2	split sequences into files by size/parts (FASTA, PE/SE FASTQ)	FASTA/Q			★★
	head	print first N FASTA/Q records	FASTA/Q
	head-genome	print sequences of the first genome with common prefixes in name	FASTA/Q
	range	print FASTA/Q records in a range (start:end)	FASTA/Q
	pair	match up paired-end reads from two fastq files	FASTA/Q
edit	concat	concatenate sequences with same the ID from multiple files	FASTA/Q	+ only		★★★
	replace	replace name/sequence by regular expression	FASTA/Q	+ only		★★
	restart	reset start position for circular genome	FASTA/Q	+ only		★
	mutate	edit sequence (point mutation, insertion, deletion)	FASTA/Q	+ only
	rename	rename duplicated IDs	FASTA/Q			★
ordering	sort	sort sequences by id/name/sequence/length	FASTA preffered			★★
	shuffle	shuffle sequences	FASTA preffered
BAM processing	bam	monitoring and online histograms of BAM record features	BAM

Notes:

Strand-sensitivity:
- + only: only processing on the positive/forward strand.
- + and -: searching on both strands.
- + or/and -: depends on users' flags/options/arguments.
Multiple-threads: Using the default 4 threads is fast enough for most commands, some commands can benefit from extra threads.
Popularity: Bases on statistics of 227 publications citing seqkit since 2020.

Citation

W Shen, S Le, Y Li*, F Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.

Contributors

Wei Shen
Botond Sipos: bam, scat, fish, sana, watch.
others

Acknowledgements

We thank Lei Zhang for testing SeqKit, and also thank Jim Hester, author of fasta_utilities, for advice on early performance improvements of for FASTA parsing and Brian Bushnell, author of BBMaps, for advice on naming SeqKit and adding accuracy evaluation in benchmarks. We also thank Nicholas C. Wu from the Scripps Research Institute, USA for commenting on the manuscript and Guangchuang Yu from State Key Laboratory of Emerging Infectious Diseases, The University of Hong Kong, HK for advice on the manuscript.

We thank Li Peng for reporting many bugs.

We appreciate Klaus Post for his fantastic packages ( compress and pgzip ) which accelerate gzip file reading and writing.

viktorxia/seqkit