/gccontent-benchmark

Benchmarking different languages for a simple bioinformatics task (Counting the GC fraction of DNA in a FASTA file)

Primary LanguageRustMIT LicenseMIT

Comparing string processing performance of programming languages

... using a simple bioinformatics task: Computing the GC fraction of DNA. It is based on the GC content problem at Rosalind.

Usage

make
cat report.md

If you have pandoc installed, you can also create a HTML report:

make html-report
<browser> report.html

More info

This is a continuation of a previous benchmarking project, covered in this blog post.

The idea is to compare the string processing performance of different programming languages by implementing a very small a very simple algorithm and task: Read a specific file containing DNA sequence in the FASTA format, and compute the GC content in this file.

Two requirements apply:

  1. The file must be read line by line (since DNA files are in reality ofter bigger than RAM, and this also helps make the implementations remotely comparable)
  2. For each line, the program has to check if it starts with a > character, which if so means it is a header row and should be skipped.

The FASTA file can contain DNA letters (A,C,G,T) or unknowns (N), or new-lines (Unix style \n ones).

This is it. Please have a look in the Makefile, and the various implementations in the code directories, or send a pull request with your own implementation (if the language already exists, increase the number one step, so for a new Go implementation, you would create a golang.001 folder, optionally with some tag appended to it, like: golang.001.table-optimized, etc).

Results

These are some results (Execution times in seconds, smaller is better) from running some of the tests in the Makefile, on a Dell Inspiron laptop with an Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz, with Xubuntu 18.04 Bionic LTS 64bit as operating system.

(Below the tables are some more details about BIOS settings etc).

Language and implementation Execution time (s) Compiler or interpreter version
rust.003.vectorized 0.442 rustc 1.52.0-nightly (152f66092 2021-02-17)
rust.004.simd 0.445 rustc 1.52.0-nightly (152f66092 2021-02-17)
rust.002.bitshift 0.695 rustc 1.52.0-nightly (152f66092 2021-02-17)
rust.001 0.891 rustc 1.52.0-nightly (152f66092 2021-02-17)
c.001 0.970 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
cpp.001 1.025 g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
d 1.215 LDC - the LLVM D compiler (1.22.0): based on DMD v2.092.1
c 1.226 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
go.001.unroll 1.616 go version go1.15 linux/amd64
nim.003.zerocopy 1.660 Nim Compiler Version 1.2.6 [Linux: amd64]
nim.002 1.703 Nim Compiler Version 1.2.6 [Linux: amd64]
julia 1.926 julia version 1.5.3
go 1.937 go version go1.15 linux/amd64
c.003.ril 1.955 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
nim.001 2.281 Nim Compiler Version 1.2.6 [Linux: amd64]
crystal.002.peek 2.369 Crystal 0.36.1 [c3a3c1823] (2021-02-02) LLVM: 10.0.0
pypy 2.677 Python 2.7.13 (5.10.0+dfsg-3build2, Feb 06 2018, 18:37:50) [PyPy 5.10.0 with GCC 7.3.0]
cpp 2.832 g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
nim 2.976 Nim Compiler Version 1.2.6 [Linux: amd64]
rust 3.195 rustc 1.52.0-nightly (152f66092 2021-02-17)
crystal 4.054 Crystal 0.36.1 [c3a3c1823] (2021-02-02) LLVM: 10.0.0
ada 4.235 GNAT Community 2020 (20200818-93)
java 4.279 openjdk version "11.0.10" 2021-01-19 OpenJDK Runtime Environment GraalVM CE 21.0.0.2 (build 11.0.10+8-jvmci-21.0-b06)
crystal.001.csp 4.283 Crystal 0.36.1 [c3a3c1823] (2021-02-02) LLVM: 10.0.0
java 4.284 openjdk version "11.0.10" 2021-01-19 OpenJDK Runtime Environment GraalVM CE 21.0.0.2 (build 11.0.10+8-jvmci-21.0-b06)
cython 6.016 Cython version 0.26.1
fpc 6.578 Free Pascal Compiler version 3.0.4+dfsg-18ubuntu2 [2018/08/29] for x86_64
node 6.836 v15.9.0
perl 7.323 This is perl 5, version 26, subversion 1 (v5.26.1) built for x86_64-linux-gnu-thread-multi
python 8.855 Python 3.7.0
graalvm 11.734 GraalVM Version 21.0.0.2 (Java Version 11.0.10+8-jvmci-21.0-b06)

Results with relaxed constraints on reading line-by-line

The below contributed versions departs slightly from reading line-by-line (by some definition of that requirement, which is clearly very hard to define):

Language Execution time (s) Compiler versions
rust.007.rawio 0.221 rustc 1.52.0-nightly (152f66092 2021-02-17)
rust.005.rawio 0.318 rustc 1.52.0-nightly (152f66092 2021-02-17)
C.002.rawio 0.524 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
rust.006.rawio 0.539 rustc 1.52.0-nightly (152f66092 2021-02-17)

More details about settings used when benchmarking

The following CPU options were turned off in BIOS, to try to avoid fluctuating CPU clock frequencies:

  • Performance > Intel SpeedStep
  • Performance > C-States Control
  • Performance > Intel TurboBoost
  • Power Management > Intel Speed Shift Technology

Benchmarking was done with other GUI apps, networking and bluetooth turned off.

Incomplete list of contributions before merge to GitHub

For contributors after establishing the GitHub repo, see this page here on GitHub.

Below is additionally an incomplete list of people who contributed to the code examples while the benchmark was only hosted on my old blog:

  • Daniel Spångberg (working at UPPMAX HPC center at the time) contributed numerous, extremely fast implementations in C, including the one above (c), which is constrained by the requirement to process the file line by line.
  • Roger Peppe (twitter) contributed the fastest Go implementation, including pointers in combination with a table lookup.
  • Mario Ray Mahardhika (aka leledumbo) contributed the fastest FreePascal implementation, which is the one above (fpc.000).
  • Harald Achitz provided the C++ implementation used above (cpp.000).
  • (Who is missing here?)