seqhasher
is a command-line tool designed to calculate a hash (digest or fingerprint)
for each sequence in a FASTA file and add it to a sequence header.
seqhasher [--options] input_file [output_file]
Parameters:
--name: An optional parameter that replaces the input file name in the header of the output with the specified text.
--nofilename: Optional. Disables adding a file name to the sequence header.
--headersonly: Optional. Outputs only sequence headers.
--hashtype: Optional. The hash type: sha1 (default), md5, xxhash, cityhash, murmur3.
input_file: Specifies the path to the input FASTA file or '-' to use standard input (stdin).
output_file: Specifies the path to the output file or '-' to use standard output (stdout). This parameter is optional; if not provided, the output will be directed to stdout by default.
The tool can either read the input from a specified file or from standard input (stdin
),
and similarly, it can write the output to a specified file or standard output (stdout
).
The --name
option allows to customize the header of the output by specifying
a text to replace the input file name.
The --hashtype
option allows to specify which hash function to use.
Currently, the following hash functions are supported:
- SHA1 (default), 160-bit hash value
- MD5, 128-bit hash value
- xxHash (extremely fast), 64-bit hash value
- CityHash (e.g., used in VSEARCH), 128-bit hash value
- Murmur3 (e.g., used in Sourmash, but 64-bit), 128-bit hash value
To process a FASTA file and output to another file:
seqhasher input.fasta output.fasta
To process a FASTA file from standard input and output to standard output, while replacing the file name in the header with 'Sample':
cat input.fasta | seqhasher --name 'Sample' - - > output.fasta
# OR
seqhasher --name 'Sample' - - < input.fasta > output.fasta
To evaluate the performance of two solutions for processing DNA sequences,
we utilized hyperfine
to compare
an AWK-based solution against the seqhasher
binary.
First, let's create the test data: a FASTA file containing 500,000 sequences, each 30 to 3000 nucleotides long.
awk -v numSeq=500000 'BEGIN{
srand();
for(i=1; i<=numSeq; i++){
seqLen=int(rand()*(2971))+30;
printf(">seq_%d\n", i);
for(j=1; j<=seqLen; j++){
r=rand();
if(r < 0.25) nucleotide="A";
else if(r < 0.5) nucleotide="C";
else if(r < 0.75) nucleotide="G";
else nucleotide="T";
printf("%s", nucleotide);
}
printf("\n");
}
}' > big.fasta
The size of the file is ~760MB.