Code and examples for JHU Computational Genomics class. Please feel free to submit issues.
The notebooks
subdirectory contains the raw JSON for the iPython notebooks referenced in class. The notebooks are also available as public GitHub gists and you can view readable versions of them at the following URLs:
- Strings, exact and approximate matching, sorting
- Suffix indexes
- Pairwise sequence alignment
- Sequence assembly
- Sequence classification
These are for teaching purposes. They are certainly not meant to be efficient. Please feel free to submit issues.
The following notebooks describe and explore some common file formats used to store genomics data. They also include Python code for parsing (and sometimes indexing) the formats.
If you are taking my class and you have any trouble accessing these resources, please contact me and I can help. All of these articles should be easily accessible from the JHU campus or via VPN / library proxy.
- Class readings (see syllabus for where these fit in)
- Life and its Molecules by Lawrence Hunter
- A decade's perspective on DNA sequencing technology by Elaine Mardis
- Sequencing technologies -- the next generation by Michael Metzker
- The DNA Data Deluge by Mike Schatz and Ben Langmead
- Suffix arrays: a new method for on-line string searches by Udi Manber and Gene Myers
- Introduction to the Burrows-Wheeler Transform and FM Index by Langmead
- Assembly of large genomes using second-generation sequencing by Mike Schatz et al
- How to apply de Bruijn graphs to genome assembly by Phillip Compeau et al
- Computational prediction of eukaryotic protein-coding genes by Michael Zhang
- Further reading
- Replacing suffix trees with enhanced suffix arrays by Mohamed Abouelhoda et al
- A Block-sorting Lossless Data Compression Algorithm by Michael Burrows and David Wheeler (describes Burrows-Wheeler Transform)
- Opportunistic data structures with applications by Paolo Ferragina and Giovanni Manzini (describes FM Index)
- Ultrafast and memory-efficient alignment of short DNA sequences to the human genome by Langmead et al (describes Bowtie)
- Fast and accurate short read alignment with Burrows–Wheeler transform by Heng Li and Richard Durbin (describes BWA)
- Videos
- Animation of DNA wrapping and replication
- Animation of Transcription and translation
- PBS Documentary "DNA" (getting old, but still very good)
- Part 1 of 5: The Secret of Life
- Part 3 of 5: The Human Race
- Video describing how Illumina's sequencing-by-synthesis technology works
- Animation of how one "3rd-generation" sequencer works
- Many cool animations of biological phenomena by John Kyrk
- Demo of pairwise sequence alignment
- Next-Generation Sequencing Technologies, presentation by Elaine Mardis at NHGRI in 2012
- Presentation describing 1st, 2nd and 3rd generation sequencing (with campy music)
- Note: Helicos is defunct, and Roche 454 and Life Tech SOLiD technologies are not very popular any more
- Videos on the basics of Git and GitHub
- Notebooks
- Traveling Salesman Problem by Peter Norvig
- Write a Genome Assembler by Jason Chin
- Textbooks and lecture notes for other classes
- Algorithms by Vazirani et al
- Check out the first two chapters if you need some analysis review, and check out the chapter on dynamic programming.
- Algorithms by Vazirani et al
- Recorded lectures for this class
- Boyer-Moore
- Higher-order HMM
- Spliced alignment
- Minimum path cover on DAG to recover isoforms
- Pair HMMs
- Profile HMMs