This is the Coursera course offered by JHU for Genomic Data Science

Python in Genomic Data Science

How to sort a dirtionary in Python

It is not possible to sort a dictionary, only to get a representation of a dictionary that is sorted. Dictionaries are inherently orderless, but other types, such as lists and tuples, are not. So you need an ordered data type to represent sorted values, which will be a list—probably a list of tuples.

For instance,

import operator  
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}  
sorted_x = sorted(x.items(), key=operator.itemgetter(1))

sorted_x will be a list of tuples sorted by the second element in each tuple. dict(sorted_x) == x.

And for those wishing to sort on keys instead of values:

import operator
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_x = sorted(x.items(), key=operator.itemgetter(0))

Algorithms: Read Alignment Problem

Naive matching

def naive(p, t):
    occurrences = []
    for i in range(len(t) - len(p) + 1):  # loop over alignments
        match = True
        for j in range(len(p)):  # loop over characters
            if t[i+j] != p[j]:  # compare characters
                match = False
                break
        if match:
            occurrences.append(i)  # all chars matched; record
    return occurrences

Lets say y is the length of reference genome, x is the length of the reads:

The maximum number of comparison would be x(y-x+1) times
The minimum number of comparison would be y-x+1 times
This algorithm tends to have number of comparisons closer to minimum

Boyer-Moore

Matching from left-to-right of the pattern and skip the maximum of either:

Bad character rule: Upon mismatch, skip alignments until:
a) mismatch becomes a match, or
b) p moves past mismatched character
Good suffix rule: Let t = substring matched by inner loop; skip until:
a) there are no mismatches between p and t or
b) p moves past t
naive matching (i.e. 1)

Preprocessing

For a given pattern p and a given reference text T, pre-processing p, we can use the same pre-processed result for different T, samewise, we can use the same pre-processed result derived from T to test many different given p
if an algorithm pre-processes T, then it is an Offline algorithm
Otherwise, it is an Online algorithm, Boyer-Moore is an Online algorithm

kmer indexing

k-mer: substring of length k
Indexing DNA: Constructing an index which contain every possible k-mer substring in DNA T with its left-most offset associated.
Quering index: Quering the any k-mer substring in pattern p, if we find one matching, we call it a hit which indicates there may be a matching around the region containing the substring.
Verification: Chech the rest bases of the pattern matching or not.
Note kmer indexing is an offline algorithm (preprocessing T).

Distance

Hamming distance

For X & Y where |X| = |Y|,
hamming distance = minimum # substitutions needed to turn one into the other.

Edit distance

For X & Y,
edit distance = minimum # edits (substitutions, insertions, deletions) needed to turn one into the other

If len(X) != len(Y), the minimum of Edit distance is ||X| - |Y||, i.e. editDistance(X,Y) >= ||X| - |Y||

Dynamic Programming for implementing Edit distance

Given a substring A and a character which in together comprise the string Aa, similarly, Bb is another string.
edist(Aa,Bb) = min(edist(A,Bb)+1, edist(Aa,B)+1, edist(A,B)+delta(a,b))
where delta = 1 if a != b, delta = 0 if a = b

def editDistance(x, y):
	D = []
	# initialize the matrix D with size (len(x)+1) * (len(y)+1)
	for i in range(len(x)+1):
		D.append([0]* (len(y)+1))

	# the distance between empty string and a string L is always len(L)
	for i in range(len(x)+1):
		D[i][0] = i
	for i in range(len(y)+1):
		D[0][i] = i

	for i in range(1, len(x)+1):
		for j in range(1, len(y)+1):
			distHor = D[i][j-1] + 1
			distVer = D[i-1][j] + 1
			if x[i-1] == y[j-1]:
				distDiag = D[i-1][j-1]
			else:
				distDiag = D[i-1][j-1] + 1

			D[i][j] = min(distHor, disVer, distDiag)
	return D[-1][-1]

Approximate matching in editDistance

For a [p X T] matrix:
we can find the minimal edit distance between p and a substring of T by
Taking the minimum value in the bottom and then track back along with decreasing value.
A vertical move in the traceback corresponds to a insertion in p with respect to T.
A horizental move in tracback corresponds to a deletion in p with respect to T.

Global alignment --- Penalty matrix

Edit distance weighs all differences equally, where as global alignment can give different weights to different mutations

Transitions(A<->G or C<->T) are more likely to happen than transversions(others cases)
Substitutions are more likely to happen than indels(deletions or insertions)

Local alignment --- scoring matrix

Global alignment is concerned with the overall similarity between two strings X and Y, whereas local alignment is concerned with similarities between substrings of X and Y

negative score for mismathes
positive score for matches

Algorithm: Assembly Problem

First law of assembly: If a suffix of read A is similar to a prefix of read B, then A and B might overlap in the genome.
Second law of assembly: More coverage leads to more and longer overlaps. Coverage = total length of all reads / length of the genome
Third law of assembly: Reapeats makes Assembly Problem hard to solve since it is very hard to determine how many repeats are there.

def overlap(a, b, min_length = 3):
	start = 0

	while True:
		start = a.find(b[:min_length], start)
		if start == -1:
			return 0
		if b.startswith(a[start:]):
			return len(a)-start
		start += 1

from itertools import permutations

def naive_overlap_map(reads, k):
	olaps = {}
	for a,b in permutations(reads, 2):
		olen = overlap(a,b,min_length=k)
		if olen > 0:
			olaps[(a,b)] = olen
	return olaps

Shortest Common Superstring

Give a set of string S, the SCS is the shortest string containing the strings in S as substrings

Directed Graph (Overlap Graph)

Each node represent a substring Each arrow represent how many overlaps are between two strings(Directed: suffix of the previous overlap the prefix of the later)

Eulerian Walks

De Brujin Graphs --- multigraph

For every adjacent pair, adding an directed arrow.
To use De Brujin Graph in Assembly problem, we:

First find all k_mers
Then find left k-1_mer and right k-1_mer in each k-mer
Lastly add directed edge to two k-1_mers

def de_bruijn_ize(st, k):
	edges = []
	node = set()
	for i in range(len(st) - k + 1):
		edges.append((st[i:i+k-1]), st[i+1:i+k])
		nodes.add(st[i:i+k-1])
		nodes.add(st[i+1:i+k])
	return nodes, edges

Some properties of the graph:

Each k-mer has an edge
One node per distinct k-1_mer

To reconstruct the sequence of the assemble, just walk through the nodes along the directed edges exactly once. It is known as Eulerian Walks

Problems in Real Assembly tools

Sequencing errors --- Deadend in graphs or disconnected graphs
Contain edges that can be inferred by other edges (transitional inferable) --- Occurs in overlap graph
Polyploidy --- Bubbles in the graphs
Repeats: Main problem, ususally assembles the unambiguous subsequences only.

Possible Improvement

Increase the number of k in k-mers in order to contain the repeated patterns with unique patterns surrounded. (pair-end sequence)

CommandLine tools for Genomic Data Science

Unix Basic

more/less: read file content
diff/comm: compare file content
gzip/gunzip bzip2/bunzip2 tar -cvf/tar -xvf: zip/uzip and achive files
cut/sort/grep/wc: quering file content
cat/vim: modify file content

Sequences Representation Format

Genome annotation = determine the precise location and structure (intervals, or lists of intervals, and associated biological information) of genomic features along the genome

BED (Browser Extensible Data) format (0-based coordinates)

Basic format

chr | start | end ---- 3 columns

Extended format:

0-based coordinates: A|C|A|G|C|T|A
0 1 2 3 4 5 6 7
1-based coordinates: 1 2 3 4 5 6 7

GTF (Genomic Transfer Format)

each interval feature taks one line
Column 9 can have additional attributes
Columns 1-9 separated by tab '\t'; fields within column 9 seperated by space ' '
Coordinates are 1-based

GFF (Genomic Feature Format)

SAM/BEM (Representation for alignments)

Header

@HD: header: VN:version SO:sorted or not
@SQ: sequences: SN:chr_id LN:length
@PG: program: ID:program_name VN:version CL:commanline used for parameter setting

Alignments:

Read id
Flag
Chr
Start
Mapping quality
CIGAR(alignment)
Mate chr (=: same chr, * : not mapped, or the name of the chr which the mate located)
Mate start
Mate dist
Query seq
Query base quals
Alignment score (eg. AS:i:0)
Edit distance to reference (e.g. NM:i:0)
Number of hits (e.g. NH:i:10)
Strand (e.g. XS:A:-)
Hit index for this alignment (HI:i:0)

Flag

multiple segments (mates)
each segment properly aligned
segment unmapped
next segment unmapped
SEQ is reverse complemented in the alignment
SEQ of next segment is reverse complemented
first segment (mate)
last segment (mate)
secondary alignment
not passing quality checks
PCR or optical duplicate
supplementary alignment

Note: CONVER TO BINARY DIGITS THEN READ FROM RIGHT TO LEFT, e.g. 0000 0110 0011 => 1100 0110 0000

Tags: A = character, i = integer, f = float, Z = string, H = hex string

CIGAR

M: match (sequence match or substitution)
I: insertion to the reference
D: deletion from the reference
N: skipped region (intro)
S: soft clipping (sequence start or end not aligned -> seq appears in SEQ)
H: hard clipping (seq not in SEQ)
P: padding first segment
=: sequence match
X: sequence mismatch

mpileup format

chr | pos | ref | nrds(depth) | bases | quals | (optional)readpos ---- 6-7 columns

***bases

- : match, forward
, : mismatch, reverse
T : mismatch (to T), forward
t : mismatch (to T), reverse
^ : beginning of read
$ : end of read
+[0-9]+[ACGTNacgtn]+ insertion in the reference(e.g. +3ACC)
-[0-9]+[ACGTNacgtn]+ deletion in the reference(e.g. -2GG)
> : reference skip

VCF/BCF (Variant Code Format)

chr | pos | ID | ref | alt | qual | filter | info | format | NA00001

INFO

NS: number of samples with data
DP: total depth
AF: allele frequency
AA: ancestral allele
DB: dbSNP membership, build 129
H2: HapMap2 membership

FORMAT

GT: genotype
GQ: genotype quality
DP: read depth
HQ: haplotype quality

e.g. Ref:g c a G g t
Var:g c a A g t
chr | pos | ID | ref | alt | qual | filter | info
14 | 4 | . | G | A | . | PASS | DP=100

Ref: g c a G g t
Var: g c a - g t
chr | pos | ID | ref | alt | qual | filter | info
14 | 3 | . | AG | A | . | PASS | DP=100

Ref: g c a G g t
Var1:g c a - g t
Var2:g c a A g t
Var3:g c a Gtg t
chr | pos | ID | ref | alt | qual | filter | info
14 | 3 | . | AG | A,AA,AGT | . | PASS | DP=100

Commandline tools

Excute command in background: nohup [command] &

SAMTOOLS

view: samtools view -bT (fa file) (sam file) > (bam file)
sort
index
merge
mpileup

BEDTOOLS

intersect (-wo,-wao,-wa)
bedtobam/bamtobed

Bioconductor

GRanges

A GRanges is a data structure for storing ggenomic intervals
Functionality in GenomicRanges and IRanges, they are fast and efficient
Every R user dealing with genomic data needs to master GRanges

Statistics

Exploratory analysis

Nomrmalization and preprocessing

Dimension Reduction

Sigular value decomposition: U x D x V^T

U -- left sigular vector/eigenarrays: column of U describe patterns across columns (arrays)
D -- sigular value:eigenvector: Sigular value decoposition -- explain the variation
percentage of variance explained of ith D value = D_i^2/sum(D^2)
V^T -- right sigular vector/eigengenes: columne of V^T describe patterns across rows (genes)

## remove the effect of row means which is the primary variance, we want to know the variance between samples and genes.
edata_centered = edata - rowMeans(edata)
svd1 = svd(edata_centered)
plot(svd1$d,ylab = "Singular values",col = 2)
plot(svd1$d^2/sum(svd1$d^2),ylab = "Percent Variance Explained", col =2)

par(mfrow = c(1,2))
plot(svd1$v[,1],col = 2, ylab = "1st PC")
plot(svd1$v[,2],col = 2, ylab = "2nd PC")

plot(svd1$v[,1],svd1$v[,2],ylab = "2nd PC",xlab = "1st PC",col = as.numeric(pdata$study))

boxplot(svd1$v[,1] ~ pdata$study,border = c(1,2))
points(svd1$v[,1] ~ jitter(as.numeric(pdata$study)), col = as.numeric(pdata$study))

pc1 = prcomp(edata)
edata_centered2 = t(t(edata)-colMeans(edata))
svd2 = svd(edata_centered2)
plot(pc1$rotation[,1],svd2$v[,1],col = 2)

Quantile Normalization

Use to reomve techinical variability or batch effect within groups. Do NOT use when the variation is due to biological effect. Fig.1 schematic of quantile normalization

library(preprocessCore)
colramp = colorRampPalette(c(3,"white",2))(20)
plot(density(edata[,1]),col = colramp[1],lwd = 3,ylim = c(0,.30))
for(i in 2:20)
{
	lines(density(edata[,i]),lwd = 3, col = colramp[i])
}

norm_edata = normalize.quantiles(as.matrix(edata))
plot(density(norm_edata[,1]),col = colramp[1],lwd = 3,ylim = c(0,.30))
for(i in 2:20)
{
	lines(density(norm_edata[,i]),lwd = 3, col = colramp[i])
}

svd1 = svd(norm_edata - rowMeans(norm_edata))
plot(svd1$v[,1],svd1$v[,2], col = as.numeric(pdata$study))

Statistical modeling

Linear Model

library(Biobase)
library(broom)

edata = as.matrix(edata)

lm1 = lm(edata[1,] ~ pdata$age)

plot(pdata$age, edata[1,],col = 1)
abline(lm1,col =2 ,lwd = 3)

lm2 = lm(edata[1,] ~ pdata$gender)

lm3 = lm(edata[1,]~pdata$age + pdata$gender)

lm4 = lm(edata[1,]~pdata$age * pdata$gender)

Many Regression Model

library(Biobase)
library(limma)
library(edge)

mod = model.matrix( ~ pdata$strain)
fit = lm.fit(mod, t(edata))

hist(fit$coefficients[1,],braks = 100,col = 2, xlab = "intercept")

edge_study = build_study(data = edata, grp = pdata$strain, adj.var = as.factor(pdata$lane.number))
fit_edge = fit_models(edge_study)
summary(fit_edge)

Fair95/GenomicDataSci-JHU