"Follow the data!" was the dictum my advisor, Hans Frauenfelder gave me more than 30 years ago as I was thinking about what to do post-PhD. At the time Hans said that, biology wasn't especially data-rich, but now the situation has changed. For example, the Webb Space Telescope is expected to produce around 200 TB of data per year. A premier biological sequence observatory, the Broad Institute, has been producing that has been producing that much data per month for the last few years.
Moreover, the number of labs with sequencers in them is a lot larger---and growing faster--than the number of labs with telescopes. It's not just sequencers, either; there are sizeable data flows from protein crystallography at synchrotron beamlines worldwide, and there's about to be huge streams coming out of microscopy-driven projects such as the Human Brain Mapping Initiative. Biology has quietly become the most data-intensive science of the Age of Big Data.
Throughout my career I've worked on motions in proteins. I've returned to this work lately with a new perspective brought on by AI-driven structural methods such as AlphaFold.
Here's an essay on creating the Theory of Biology by building bridges among data, signatures, models, and applications.
Much of my recent work has been on genomic sequences. Here are some thoughts on bioinformatics and bioinformaticians.
I create software to analyze data, and I try to write for the future as well as to solve particular problems today. If I do my part well, my efforts are to be a model and example, not just a means to an end. Here are my thoughts on writing scalable software.
I have published roughly 50 papers with over 13,000 citations that explore the interrelations among sequences, structures, gene-family trees, dynamics, and hydration.
Here is an overview of some of my code repositories and other places where I've contributed:
- azulejo combines guilt-by-profiling (genome synteny) and guilt-by-association (phylogeny) to create pangenomic collections of gene families and tile phylogenetic space with supertrees of proxy genes. Uses out-of-memory external merges and a novel "peatmer" algorithm to achieve linear scaling with a small memory footprint to accomodate large numbers of input genomes.
- click_loguru Most scientific code needs a CLI and benefits from logging to a file. This repository combines those two needs and is the starting point for most of my active codes.
- pytest-datadir-mgr Most scientific code needs to test using data files too large to be kept in the repository. The code in this repository makes downloading input data and saving intermediate results an easier task.
- pybio Gentoo Overlay Computational biologists need a development distro, and for years mine has been Gentoo i because of the large number of biology-related packages and because its a source-code distribution. (For production and container use, I like Clear Linux for its performance and update properties.) This private repo contains another 100 or so packages that I find useful on top of the 200 in the main tree and the 300 in the Science overlay.
- aakbar This Amino Acid K-mer calculator can be used to calculate signature peptides by phylogenetic or other means. Its output can be used in Sequedex or other signature methods. Its input can be raw proteomes, but it's better with sets of proxy genes from azulejo.
- alphabetsoup Parallel data wrangling of input sequences, including alphabet checking and removal of some ugly but common artifacts.
- Sequedex is R&D100 award-winning software that uses scalable signature methods to classify short DNA sequences as to where they come from and what they do. Sequedex is mostly used in metagenomics and surveillance for emergent infectious diseases. The Sequedex open-source repository is here.
- SOLVE is R&D100 award-winning software that helps automate the problem of phasing X-ray crystal structures of proteins. It calculates a statistic that acts very much like autofocus on a camera. SOLVE is closed-source.
You can comment on this page or reach me on Twitter.