Using list of sequences vs. PWM to generate plot produces different bit heights

Question

Using list of sequences vs. PWM to generate plot produces different bit heights

sd46w opened this issue 5 years ago · 2 comments

Hi! When I use a list of sequences to generate the plot, the bit heights are slightly smaller than when I use the position weight matrix. Do you know why this might be?

I have included the code I wrote to generate the PWM and the plots, the list of sequences and the PWM (exported to excel docs), and the plots below.

Thank you!

#Generate position weight matrix
Background_P3_Blunt = matrix(ncol=20, nrow = nrow(P3_Blunt))
colnames(Background_P3_Blunt) = c("1","2","3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20")
print(Background_P3_Blunt)

Background_Sequences_P3_Blunt = unlist(P3_Blunt$Complement_Target_Sequence)
Background_Sequences_P3_Blunt = as.character(Background_Sequences_P3_Blunt)

for (i in 1:nrow(Background_P3_Blunt))
{Background_P3_Blunt[i,]= strsplit(Background_Sequences_P3_Blunt, "")[[i]]}

Background_P3_Blunt1 = matrix(ncol=20, nrow = 4)
colnames(Background_P3_Blunt1) = c("1","2","3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20")
rownames(Background_P3_Blunt1)=c("A","U","G","C")

Count_P3_Blunt=nrow(Background_P3_Blunt)

for (i in 1:20)
{
Background_P3_Blunt1[1,i]= (length(which(Background_P3_Blunt[,i] == "A")))/ Count_P3_Blunt
Background_P3_Blunt1[2,i]= (length(which(Background_P3_Blunt[,i] == "U")))/ Count_P3_Blunt
Background_P3_Blunt1[3,i]= (length(which(Background_P3_Blunt[,i] == "G")))/ Count_P3_Blunt
Background_P3_Blunt1[4,i]= (length(which(Background_P3_Blunt[,i] == "C")))/ Count_P3_Blunt
}

Generate seqlogo plot

Using list of sequences

Count_P3_Blunt1 = length(Background_Sequences_P3_Blunt)
jpeg("Background_P3_Blunt_Sequences.jpg")
P3_Blunt_Background_seqlogo = ggseqlogo(Background_Sequences_P3_Blunt)
P3_Blunt_Background_seqlogo + annotate("text", x=10, y=0.09, label=paste("n =", Count_P3_Blunt1, "siRNA sequences tested"), size = 8) + annotate("text", x=10, y=0.1, label=paste("P3 Blunt Background"), size =10)
dev.off()

Using position weight matrix

jpeg("Background_P3_Blunt_PWM")
P3_Blunt_Background_seqlogo1 = ggseqlogo(Background_P3_Blunt1)
P3_Blunt_Background_seqlogo1
P3_Blunt_Background_seqlogo1 + annotate("text", x=10, y=0.09, label=paste("n =", Count_P3_Blunt, "siRNA sequences tested"), size = 8) + annotate("text", x=10, y=0.1, label=paste("P3 Blunt Background"), size =10)
dev.off()

Here is the position weight matrix and the plot it generates

Background_P3_Blunt1.xlsx

Here is the list of sequences and the plot it generates

Background_Sequences_P3_Blunt.xlsx

Answer 1 · 2020-06-01T23:14:02.000Z

Hey Sarah, I replied to your email but perhaps it didn't go through.

The reason they are different is that ggseqlogo scales each position by the information content. This information is encoded in the matrix generated by ggseqlogo. If you provide a raw relative frequency matrix as you did, it does not have any knowledge of information content and thus does not scale the columns accordingly.

See https://rdrr.io/cran/ggseqlogo/src/R/heights.r for how ggseqlogo generates the matrix (under makePFM)
See https://en.wikipedia.org/wiki/Sequence_logo under logo creation for more info.

I would recommend you feed the sequences directly into ggseqlogo and let it handle this. The matrix functionality was created for cases where the input sequence was not available.

Answer 2 · 2020-06-01T23:26:55.000Z

Hi! Thank you so much - I didn't receive your email... weird! This makes sense. Thanks again.