Source code and related materials for Genomics in the Cloud, an O'Reilly book by Geraldine A. Van der Auwera and Brian D. O'Connor.
You can find the book in the O'Reilly Learning Library at https://oreil.ly/genomics-cloud, on Amazon (Kindle or paperback), and in both ebook and print formats from a variety of other booksellers. We do encourage you to get it through your local independent bookstore if you’re able.
Data in the genomics field is booming. In just a few years, organizations such as the National Institutes of Health (NIH) will host 50+ petabytes—or 50 million gigabytes—of genomic data, and they’re turning to cloud infrastructure to make that data available to the research community. How do you adapt analysis tools and protocols to access and analyze that data in the cloud?
With this practical book, researchers will learn how to work with genomics algorithms using open source tools including the Genome Analysis Toolkit (GATK), Docker, WDL, and Terra. Geraldine Van der Auwera, longtime custodian of the GATK user community, and Brian O’Connor of the UC Santa Cruz Genomics Institute guide you through the process. You’ll learn by working with real data and genomics algorithms from the field.
This book takes you through:
- Essential genomics and computing technology background
- Basic cloud computing operations
- Getting started with GATK
- Three major GATK Best Practices pipelines for variant discovery
- Automating analysis with scripted workflows using WDL and Cromwell
- Scaling up workflow execution in the cloud, including parallelization and cost optimization
- Interactive analysis in the cloud using Jupyter notebooks
- Secure collaboration and computational reproducibility using Terra
For more information about the book and why you might find it useful, please see the Genomics in the Cloud blog.
See the commands folder for text files that let you easily copy and paste the commands from the hands-on exercises.
For those of you reading the print version of the book, which does not include color figures, we've made the figures available in full color in the figures directory of the GCS bucket.
You may use all figures except 3-3 and 6-15 in your own non-commercial work, preferably with a notice of attribution referring to the book. For commercial use, please contact permissions@oreilly.com. Figures 3-3 and 6-15 do not belong to us, so you must request permission from their respective owners, which are noted in the book.
We also put together a companion booklet that contains the figures and their captions for more convenient browsing or printing. It's "semi-official" in the sense that we created and maintain it, but it is not published by O'Reilly, so it does not go through their quality control process. Think of it as an artisanal, locally sourced side dish.
We have a blog for the book at https://broadinstitute.github.io/genomics-in-the-cloud/ where we cover various topics including additional tutorials, errata for the book, and regular updates on new features that you maay be interested in. Feel free to suggest blog topics by reaching out to us on Twitter or LinkedIn (see contact info below).
If you encounter errors or broken links in the book, please file an issue on O'Reilly's Errata page. Anything reported there that we can verify will get fixed and updated in both the electronic versions and subsequent printing runs of the book, so others won't run into the same problems.
We don't use Github Issues for this project to avoid confusion and redundancy with the O'Reilly Errata page.
If you run into problems while working through the hands-on exercises, or if have follow-up questions about the topics we discuss in the book, please post your questions in either the GATK forum or the Terra forum. The frontline support team will most likely be able to address your questions, and for anything else they will loop us into the conversation if you mention that your question is related to our book. If you're not sure which forum to use, just flip a coin; it's the same team that maintains both communities.
Remember also that you can often save yourself some time by searching the GATK documentation or Terra documentation before posting a question -- that way you don't have to wait for someone to get back to you.
If you'd like to get in touch, you can reach us on Twitter (@VdAGeraldine and @boconnor) and on LinkedIn (Geraldine and Brian). We look forward to hearing what you think of the book! If you like it, please consider posting a review on Amazon.