/algorithms

Primary LanguagePythonMIT LicenseMIT

Algorithms

In the Korf Lab, our programming goal is to write simple, beautiful code. When one focuses on speed, efficiency, or cleverness, code becomes difficult to maintain, extend, and debug.

This repo is a collection of several demonstration programs and pipelines that are designed to be simple and beautiful. Think of these as templates for your own projects. Don't extend these source codes. They are purposefully too simple and too self-contained. Write something better.

Checklist

Providing programs/pipelines that work is only part of our goal. In order to be "beautiful" every project must be distributable, reproducible, and educational to others. Maybe no project is every truly complete, but it can be considered ready to release when it passes these criteria (which are explained more fully below).

  • GitHub
  • Open source
  • README.md
  • TUTORIAL.md
  • API documentation
  • Unix-standard CLI
  • Test data
  • Unit/functional tests

Small Tools Manifesto

It is worth reading the Small Tools Manifesto which is basically a philosophy that bioinformatics programs should work like other Linux CLI programs: small, interoperable, open source, documented, tested, etc.

Languages

  • Python - default
  • Go - high performance
  • Snakemake - pipelines
  • R - some statistical analyses
  • Perl - legacy, hacks
  • C - legacy, high performance
  • Shell - automation

Python is our default language. Python is a very popular and useful language with minimal development time, so always start here.

Go is our language of choice for our high performance needs. It's a good idea to develop your application in Python before Go as the one will help debug the other.

Snakemake is becoming the community standard for pipelines, so that's what we use.

R is used for some statistical analyses. Although we don't write R very often, there are a variety of very useful programs written in R.

Perl is no longer used for large-scale projects, but remains useful for quick-n-dirty hacks.

C was our previous language for high performance needs. New projects should not be started in C, but there are some legacy projects that may still be usefully extended.

Bash and other shell languages should be used only to automate very simple tasks. If you find yourself writing loops and conditionals in bash, consider doing that in Python or Perl instead.

What about other languages like C++, C#, Java, Javascript, Julia, Kotlin, Lua, PHP, Raku, Ruby, Rust, Swift, etc? It's a good idea to educate yourself in other languages, but please don't write lab software in anything other than Python, Go, Snakemake, R, Perl, or C.

Markdown

All documentation should be in Markdown format. Make sure that the document looks good in plain text by using 80-column line breaks and padding tables. See the plain text version of this document as an example.

Conda

Generally, we use Conda to manage our environments. However, there is a big difference when it comes to building software vs. running pipelines.

When programming, don't specify the exact components of the tool chain. For example, do not specify a particular version of Python3 in your Conda environment. If some future version of Python3 breaks our software, we want our automated testing to flag that. Where possible, we would rather fix our software, not tell everyone to require out-dated components.

For pipelines, we are not in control of other peoples' software. For this reason, we have to specify our environment more precisely.

Documentation

Following standard practices, all repos should have a README.md. You should include the following in your document.

  • Intent of the software
  • Specific build requirements if any
  • External data files if required
  • Testing procedure (see below)
  • Authors who directly contributed to the project
  • Attributions to those who contributed indirectly to project
  • Any other information that might be useful

Most projects should also include a TUTORIAL.md that walks a user through using the various programs and/or libraries in the project.

Libraries should have API documentation. See STYLE.md, which describes language-specific guidelines for documentation (among other things).

Testing

Every project should include test data and automated testing procedures. The testing harness should generally follow community standards.

  • Libraries should have unit tests
  • Programs should have functional tests
  • Pipelines should have functional tests

Programs

All programs should have command line arguments that follow the Unix standard. This includes a usage statement that is reported when supplied with -h or --help. The usage statement should also be reported when the command line is malformed (e.g. no input where a positional argument is required).

Each program in this repo is written in multiple languages to get the feel of how to program in that language. Generally, we seek to follow the community style guides, however there is always some room for personal style. When coding, there are difficult choices between simple vs. abstract, simple vs. efficient, simple vs. sophisticated, and simple vs. language-specific. In this lab, we lean towards the simple.

  • randomseq - generate random FASTA files of DNA sequences
  • kmerfreq - read sequences, output a table of kmer frequencies
  • dust - read sequences, output sequences with low complexity regions masked
  • longestorf - read sequences, output the longest open reading frame of each
  • smithwaterman - read sequences, produce local alignments

randomseq

Generate random DNA sequences of fixed length. The composition of the sequences defaults to 25% for each nucleotide, but the program should take other distributions of mononucleotides.

Inputs

  • Number of sequences to generate
  • Length of each sequence
  • Probability of each letter
  • Prefix for sequence identifiers
  • Random seed

Outputs

  • Multi-FASTA format to STDOUT

kmerfreq

Determine the k-mer frequencies in a FASTA file. The value of K should be an argument with a default parameter (e.g. 3). Output format should include tab-separated and JSON.

Inputs

  • Multi-FASTA file (gzipped or STDIN)
  • K-mer size
  • Flag for JSON output (defaults to tabular)

Outputs

Either TSV or JSON to STDOUT.

dust

Mask sequences with an entropy filter. The window size and entropy should have default parameters and command line options. There should be an option to change the output from N-based (hard) masking to lowercase (soft) masking.

Inputs

  • Multi-FASTA file (gzipped or STDIN)
  • Window size
  • Entropy threshold
  • Flag for lowercase masking (defaults to N)

Outputs

  • Multi-FASTA file to STDOUT

longestorf

Translate each sequence and provide the protein sequence with the longest open reading frame. There should be an option to do 3- or 6-frame translations.

Inputs

  • Multi-FASTA file (gzipped or STDIN)
  • Flag for double-stranded translation (defaults to positive strand)

Outputs

  • Multi-FASTA file to STDOUT

smithwaterman

Classic local alignment algorithm using match, mismatch, and gap scores. There should be a query sequence and database, both in FASTA format. Output formats should include tabular (score, coordinates) and human readable (alignments).

Inputs

  • Query sequence in FASTA format (gzipped ok)
  • Database sequence in Multi-FASTA format (gzipped or STDIN)
  • Match score
  • Mismatach score
  • Gap score
  • Flag for tabular output

Outputs

  • Tabular format
  • Alignment format

Benchmarks

Lang Prog Time1 Time2 Notes
.c rseq 0.10
.go rseq 0.11
kfreq 1.18
dust 0.33
lorf 0.39
sw 0.05 Currently do not support STDIN
.pl rseq 1.05 0.46
kfreq 7.11 2.97
dust 7.21 3.14
lorf 2.75 1.14
sw 5.69 2.37
.py rseq 0.99 0.43
kfreq 5.49 2.23
sdust 6.70 slower version of dust
dust 3.11
lorf 3.52 1.81
sw 0.98

Times are recorded by running 3 times and taking the best real time. Time1 corresponds to Ian's Lenovo IdeaPad3 running LinuxLite in a VirtualBox with 2 cores and 4G RAM. Time 2 corresponds to Lightning.

At some point this should be automated

command lines for each program

The randomseq command generates a random sequence file approximately the size of a bacterial genome (4,000 genes, 1,000 bp per gene). This is used for subsequent tests.

  • go
    • time ./randomseq -num 4000 -len 1000 > foo
    • time ./kmerfreq -k 10 -in foo > /dev/null
    • time ./dust -in foo > /dev/null
    • time ./longestorf -r -in foo > /dev/null
    • time ./smithwaterman -query data/testseq.fa -db db > /dev/null
      • head -80 foo > db
  • Perl
    • time ./randomseq 4000 1000 > foo
    • time ./kmerfreq -k 10 foo > /dev/null
    • time ./dust foo > /dev/null
    • time ./longestorf -6 foo > /dev/null
    • time head -80 foo | ./smithwaterman data/testseq.fa - > /dev/null
  • Python
    • time ./randomseq 4000 1000 > foo
    • time ./kmerfreq -k 10 foo > /dev/null
    • time ./sdust foo > /dev/null
    • time ./dust foo > /dev/null
    • time ./longestorf -r foo > /dev/null
    • time head -80 foo | ./smithwaterman data/testseq.fa - > /dev/null

Pipelines

Unfinished section

  • Something with Snakemake designed for local use
  • Something with Snakemake designed for cluster use
  • A simple bash script?
  • Something with make?