Using list of sequences vs. PWM to generate plot produces different bit heights
sd46w opened this issue · 2 comments
Hi! When I use a list of sequences to generate the plot, the bit heights are slightly smaller than when I use the position weight matrix. Do you know why this might be?
I have included the code I wrote to generate the PWM and the plots, the list of sequences and the PWM (exported to excel docs), and the plots below.
Thank you!
#Generate position weight matrix
Background_P3_Blunt = matrix(ncol=20, nrow = nrow(P3_Blunt))
colnames(Background_P3_Blunt) = c("1","2","3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20")
print(Background_P3_Blunt)
Background_Sequences_P3_Blunt = unlist(P3_Blunt$Complement_Target_Sequence)
Background_Sequences_P3_Blunt = as.character(Background_Sequences_P3_Blunt)
for (i in 1:nrow(Background_P3_Blunt))
{Background_P3_Blunt[i,]= strsplit(Background_Sequences_P3_Blunt, "")[[i]]}
Background_P3_Blunt1 = matrix(ncol=20, nrow = 4)
colnames(Background_P3_Blunt1) = c("1","2","3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20")
rownames(Background_P3_Blunt1)=c("A","U","G","C")
Count_P3_Blunt=nrow(Background_P3_Blunt)
for (i in 1:20)
{
Background_P3_Blunt1[1,i]= (length(which(Background_P3_Blunt[,i] == "A")))/ Count_P3_Blunt
Background_P3_Blunt1[2,i]= (length(which(Background_P3_Blunt[,i] == "U")))/ Count_P3_Blunt
Background_P3_Blunt1[3,i]= (length(which(Background_P3_Blunt[,i] == "G")))/ Count_P3_Blunt
Background_P3_Blunt1[4,i]= (length(which(Background_P3_Blunt[,i] == "C")))/ Count_P3_Blunt
}
Generate seqlogo plot
Using list of sequences
Count_P3_Blunt1 = length(Background_Sequences_P3_Blunt)
jpeg("Background_P3_Blunt_Sequences.jpg")
P3_Blunt_Background_seqlogo = ggseqlogo(Background_Sequences_P3_Blunt)
P3_Blunt_Background_seqlogo + annotate("text", x=10, y=0.09, label=paste("n =", Count_P3_Blunt1, "siRNA sequences tested"), size = 8) + annotate("text", x=10, y=0.1, label=paste("P3 Blunt Background"), size =10)
dev.off()
Using position weight matrix
jpeg("Background_P3_Blunt_PWM")
P3_Blunt_Background_seqlogo1 = ggseqlogo(Background_P3_Blunt1)
P3_Blunt_Background_seqlogo1
P3_Blunt_Background_seqlogo1 + annotate("text", x=10, y=0.09, label=paste("n =", Count_P3_Blunt, "siRNA sequences tested"), size = 8) + annotate("text", x=10, y=0.1, label=paste("P3 Blunt Background"), size =10)
dev.off()
Here is the position weight matrix and the plot it generates
Here is the list of sequences and the plot it generates
Hey Sarah, I replied to your email but perhaps it didn't go through.
The reason they are different is that ggseqlogo scales each position by the information content. This information is encoded in the matrix generated by ggseqlogo. If you provide a raw relative frequency matrix as you did, it does not have any knowledge of information content and thus does not scale the columns accordingly.
- See https://rdrr.io/cran/ggseqlogo/src/R/heights.r for how ggseqlogo generates the matrix (under makePFM)
- See https://en.wikipedia.org/wiki/Sequence_logo under logo creation for more info.
I would recommend you feed the sequences directly into ggseqlogo and let it handle this. The matrix functionality was created for cases where the input sequence was not available.
Hi! Thank you so much - I didn't receive your email... weird! This makes sense. Thanks again.