Analysis of Genetic Data, Part 1

Research Computing Center, University of Chicago
November 2, 2016
2:00 pm - 4:00 pm
Instructor: Peter Carbonetto
Helper: Will Graybeal

General Information

In this 2-hour workshop, participants will apply simple approaches to investigate and visualize large-scale genetic data sets, with an emphasis on practical skills that can be applied to genetics research. This is intended to be a more informal, hands-on workshop, and no background in genetics is required; anyone with intermediate computing skills (see "Prerequisites") who is curious about human genetics and the "genomics revolution" is encouraged to register. Over the course of the 2 hours, interesting insights will be generated directly from "raw" genetic data, and participants can continue to explore the data independently using the techniques introduced in class.

Level: Intermediate

Prerequistes: This workshop assumes some experience performing simple tasks in a UNIX-like shell environment, as well as basic familiarity with R. Participants must be able to log in to the RCC compute cluster, although experience using the RCC cluster is not required. All participants must bring a laptop with a Mac, Linux, or Windows operating sytem that they have administrative privileges on.

Where: Kathleen A. Zar Room, John Crerar Library, University of Chicago (OpenStreetMap).

Additional info: This workshop is an attempt to apply elements of the Software Carpentry approach (see also this article) to interactive instruction for computing/quantitative sciences. Some of the materials contained within are adapted from a Stanford workshop given in March 2016. For a more in-depth exploration of the concepts and techniques introduced, see John Novembre's PopGen workshop.

Please also take a look at the Code of Conduct, and the Software License which applies to all the scripts and code examples in this repository. All instructional material contained in this repository is made available under the Creative Commons Attribution license (CC BY 4.0).

Aims

Explore the application of numeric techniques for investigating genetic diversity and population structure from large-scale genotype data.
Understand how large genetic data sets are commonly represented in computer files.
Use command-line tools to manipulate genetic data, and use R to summarize and visualize the results of a genetic data analysis.
Practice using the RCC shell environment (midway) for large-scale computation.

Episodes

Episode	Concepts
1. Setup	How do I set up my shell environment on midway for an analysis of genetic data?
2. Principal component analysis of genetic data	How do I encode genetic polymorphism data? How do I represent genetic polymorphism data as a matrix? How can I visualize the results of PCA to gain insight into structure of genetic data?
3. Making predictions using PCA	How do I ensure a consistent encoding of the genotype data? How do I map another genetic data set onto an existing PCA result? What does this mapping tell us (and not tell us) about a sample's ancestral origins?
4. ADMIXTURE analysis of genetic data	How do I visualize and interpret the results of running ADMIXTURE on genetic data? How do I use the ADMIXTURE results to make predictions for new samples?

Extras

Preparation of the 1000 Genomes genotype data

pcarbo/genetic-data-analysis-rcc-1

Analysis of Genetic Data, Part 1

General Information

Aims

Episodes

Extras