Matrix conversion and creation does not work when using amino acid sequences
weitzner opened this issue · 6 comments
Description of the bug
The library doesn't work with amino acid sequences. There are lots of internal parts that seem to omit the supplied alphabet_type
parameter, which causes it to default to DNA
. Then it throws an error because there are more than 4 letters.
To Reproduce
pfm = np.zeros((len(list_of_seqs), 20))
for seq in list_of_seqs:
for i, aa in enumerate(seq):
pfm[i, aa_alphabet.index(aa)] += 1
pfm = seqlogo.Pfm(pfm, alphabet_type="AA")
ppm = seqlogo.Ppm(seqlogo.pfm2ppm(pfm, alphabet_type="AA"), alphabet_type="AA")
This raises
ValueError: None alphabet selected, but PM is not 4 rows long
Expected behavior
This should create a Ppm
instance.
Additional context
I found a slight workaround by creating a Pfm
instead. This then fails when attempting to produce the weblogo itself:
pfm = np.zeros((len(list_of_seqs), 20))
for seq in list_of_seqs:
for i, aa in enumerate(seq):
pfm[i, aa_alphabet.index(aa)] += 1
pfm = seqlogo.Pfm(pfm, alphabet_type="AA")
ppm = seqlogo.Pfm(seqlogo.pfm2ppm(pfm, alphabet_type="AA"), alphabet_type = "AA")
seqlogo.seqlogo(ppm, ic_scale = False, size = 'medium')
However, this raises
TypeError: must be str, not bytes
which is surprising because this is a python 3.x only library.
I see your issue and thank you for bringing it to my attention. I will have to do a little "play testing" on my end to get reacquainted with the code base.
In an attempt to not minimize the issue you posted, if your work is a bit time sensitive and in need of a quick fix, please also consider @jbkinney's logomaker as its plotting capabilities are bit more robust and fleshed out compared to mine--with no dependency on weblogo as well.
Hopefully, though, I can resolve your issue quickly. Sorry for the inconvenience.
Sorry for the long delay in responding. Campus life got a little real and I finally have some bandwidth for this.
The problem doesn't actually arise in the creating of the seqlogo.Pfm
:
import random
import numpy as np
import pandas as pd
import seqlogo
def gen_seqs(n, length=25, alphabet_type='AA', seed = None):
"""Pseudo-random generation of simulated sequences
Args:
n (int): The desired number of sequences
length (int): The length (bp/aa) of the desired sequences (default: 25)
alphabet_type (str): Desired alphabet to sample from (default: 'AA')
Choose from:
"DNA" := "ACGT"
"reduced DNA" := "ACGTN-"
"ambig DNA" := "ACGTRYSWKMBDHVN-"
"RNA" := "ACGU"
"reduced RNA" := "ACGUN-"
"ambig RNA" := "ACGURYSWKMBDHVN-"
"AA" : = "ACDEFGHIKLMNPQRSTVWY"
"reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-"
"ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-"
seed (int): random seed for testing (default: None)
Returns:
(list): n length list of length long sequences
"""
random.seed(seed)
return [''.join(random.choices(seqlogo.utils._IDX_LETTERS[alphabet_type], k=length)) for _ in range(n)]
# Make my simulated sequences
alphabet_type = 'AA'
list_of_seqs = gen_seqs(15, alphabet_type = alphabet_type)
# For duplicating aa_alphabet using supported alphabets
aa_alphabet = seqlogo.utils._IDX_LETTERS[alphabet_type]
# Make the sample PFM (MODIFIED FROM ISSUE)
pfm = np.zeros((len(list_of_seqs), 20))
for i, seq in enumerate(list_of_seqs):
for aa in seq:
pfm[i, aa_alphabet.index(aa)] += 1
# Convert into a seqlogo.Pfm
pfm = seqlogo.Pfm(pfm, alphabet_type="AA")
# Get some summary data
print(f"""{pfm}
Pfm alphabet_type: {pfm.alphabet_type}
Pfm alphabet: {pfm.alphabet}
Number of Rows: {pfm.shape[0]}
Number of Columns : {pfm.shape[1]}
""")
A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 5.0 | 2.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 3.0 | 3.0 | 0.0 | 0.0 |
1 | 2.0 | 2.0 | 0.0 | 1.0 | 5.0 | 1.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 | 0.0 | 3.0 | 1.0 | 0.0 |
2 | 3.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 2.0 | 1.0 | 0.0 | 5.0 | 0.0 | 2.0 | 0.0 | 1.0 | 1.0 | 2.0 | 1.0 | 3.0 | 0.0 |
3 | 2.0 | 0.0 | 4.0 | 1.0 | 1.0 | 3.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 2.0 | 2.0 | 1.0 | 0.0 | 0.0 | 3.0 | 0.0 |
4 | 1.0 | 2.0 | 1.0 | 3.0 | 4.0 | 0.0 | 2.0 | 0.0 | 1.0 | 3.0 | 0.0 | 2.0 | 1.0 | 2.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
5 | 2.0 | 3.0 | 1.0 | 3.0 | 0.0 | 1.0 | 0.0 | 2.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 2.0 | 1.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 |
6 | 0.0 | 3.0 | 0.0 | 1.0 | 3.0 | 3.0 | 3.0 | 4.0 | 3.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 |
7 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 2.0 | 1.0 | 3.0 | 0.0 | 0.0 | 2.0 | 1.0 | 0.0 | 2.0 | 1.0 | 1.0 | 1.0 | 1.0 | 4.0 |
8 | 1.0 | 1.0 | 0.0 | 2.0 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 | 0.0 | 1.0 | 2.0 | 1.0 | 3.0 | 2.0 | 0.0 | 1.0 | 2.0 | 1.0 |
9 | 1.0 | 2.0 | 2.0 | 1.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 | 4.0 |
10 | 0.0 | 3.0 | 2.0 | 0.0 | 5.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 1.0 | 1.0 | 2.0 | 2.0 | 2.0 |
11 | 3.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 0.0 | 1.0 | 3.0 | 2.0 | 1.0 | 2.0 | 2.0 | 0.0 | 0.0 | 3.0 | 1.0 | 0.0 | 0.0 |
12 | 2.0 | 1.0 | 0.0 | 0.0 | 4.0 | 2.0 | 1.0 | 0.0 | 0.0 | 2.0 | 3.0 | 3.0 | 1.0 | 0.0 | 1.0 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 |
13 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 1.0 | 1.0 | 6.0 | 1.0 | 2.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 2.0 | 0.0 | 2.0 | 1.0 |
14 | 2.0 | 1.0 | 2.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 1.0 | 0.0 | 4.0 | 1.0 | 1.0 | 4.0 | 0.0 | 1.0 | 2.0 | 1.0 |
Pfm alphabet_type: AA
Pfm alphabet: ACDEFGHIKLMNPQRSTVWY
Number of Rows: 15
Number of Columns : 20
Converting the PFM to a PPM doesn't seem to be the issue either:
seqlogo.pfm2ppm(pfm, alphabet_type='AA')
A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00 | 0.08 | 0.00 | 0.08 | 0.12 | 0.04 | 0.04 | 0.00 | 0.00 | 0.12 | 0.08 | 0.04 | 0.04 | 0.00 | 0.04 | 0.12 | 0.04 | 0.00 | 0.08 | 0.08 |
1 | 0.12 | 0.16 | 0.00 | 0.12 | 0.00 | 0.04 | 0.08 | 0.00 | 0.04 | 0.08 | 0.08 | 0.04 | 0.00 | 0.04 | 0.08 | 0.04 | 0.04 | 0.00 | 0.04 | 0.00 |
2 | 0.00 | 0.00 | 0.08 | 0.08 | 0.04 | 0.00 | 0.08 | 0.00 | 0.12 | 0.04 | 0.04 | 0.04 | 0.04 | 0.08 | 0.08 | 0.08 | 0.00 | 0.08 | 0.04 | 0.08 |
3 | 0.20 | 0.04 | 0.00 | 0.04 | 0.04 | 0.00 | 0.00 | 0.08 | 0.12 | 0.04 | 0.00 | 0.12 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.00 | 0.08 |
4 | 0.00 | 0.12 | 0.04 | 0.08 | 0.00 | 0.12 | 0.08 | 0.00 | 0.00 | 0.04 | 0.08 | 0.08 | 0.00 | 0.08 | 0.04 | 0.00 | 0.08 | 0.04 | 0.04 | 0.08 |
5 | 0.08 | 0.04 | 0.08 | 0.00 | 0.04 | 0.08 | 0.08 | 0.00 | 0.00 | 0.08 | 0.04 | 0.04 | 0.12 | 0.04 | 0.00 | 0.08 | 0.08 | 0.08 | 0.04 | 0.00 |
6 | 0.16 | 0.04 | 0.04 | 0.04 | 0.04 | 0.08 | 0.04 | 0.04 | 0.04 | 0.00 | 0.04 | 0.04 | 0.04 | 0.08 | 0.04 | 0.08 | 0.08 | 0.00 | 0.08 | 0.00 |
7 | 0.00 | 0.00 | 0.04 | 0.00 | 0.04 | 0.00 | 0.12 | 0.00 | 0.04 | 0.20 | 0.00 | 0.00 | 0.12 | 0.00 | 0.12 | 0.00 | 0.12 | 0.08 | 0.00 | 0.12 |
8 | 0.08 | 0.00 | 0.04 | 0.04 | 0.04 | 0.08 | 0.00 | 0.04 | 0.00 | 0.00 | 0.08 | 0.00 | 0.20 | 0.00 | 0.12 | 0.04 | 0.04 | 0.08 | 0.00 | 0.12 |
9 | 0.12 | 0.08 | 0.12 | 0.04 | 0.00 | 0.04 | 0.00 | 0.12 | 0.08 | 0.00 | 0.00 | 0.12 | 0.08 | 0.00 | 0.00 | 0.16 | 0.04 | 0.00 | 0.00 | 0.00 |
10 | 0.00 | 0.00 | 0.04 | 0.04 | 0.04 | 0.00 | 0.12 | 0.04 | 0.08 | 0.04 | 0.04 | 0.04 | 0.12 | 0.16 | 0.00 | 0.00 | 0.12 | 0.04 | 0.04 | 0.04 |
11 | 0.08 | 0.04 | 0.08 | 0.04 | 0.08 | 0.04 | 0.00 | 0.04 | 0.08 | 0.08 | 0.00 | 0.00 | 0.04 | 0.00 | 0.04 | 0.00 | 0.12 | 0.12 | 0.08 | 0.04 |
12 | 0.08 | 0.00 | 0.08 | 0.04 | 0.04 | 0.04 | 0.08 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.08 | 0.20 | 0.08 | 0.04 | 0.04 | 0.12 | 0.00 | 0.04 |
13 | 0.08 | 0.00 | 0.00 | 0.08 | 0.08 | 0.04 | 0.04 | 0.24 | 0.04 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.16 | 0.00 | 0.08 | 0.08 | 0.04 |
14 | 0.04 | 0.00 | 0.12 | 0.12 | 0.12 | 0.08 | 0.00 | 0.04 | 0.04 | 0.08 | 0.08 | 0.00 | 0.08 | 0.04 | 0.00 | 0.04 | 0.00 | 0.00 | 0.08 | 0.04 |
The Problem
The issue arises when we seqlogo
uses the pfm2ppm
function...which you highlighted on. This is in part due to some inheritance issues. In my development build, that is fixed.
The second issue (one that you brought up) is that certain arguments from the conversion functions (e.g. pfm2ppm
) don't carry over to calls to Ppm
.
I have fixed this as well, and should have an update soon.
With these changes, the following works:
ppm = seqlogo.Ppm(seqlogo.pfm2ppm(pfm, alphabet_type='AA'), alphabet_type = 'AA')
ppm
A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.04 | 0.12 | 0.04 | 0.08 | 0.04 | 0.04 | 0.04 | 0.00 | 0.08 | 0.00 | 0.00 | 0.08 | 0.04 | 0.00 | 0.08 | 0.00 | 0.16 | 0.12 | 0.00 | 0.04 |
1 | 0.12 | 0.04 | 0.08 | 0.08 | 0.00 | 0.04 | 0.04 | 0.12 | 0.04 | 0.00 | 0.04 | 0.12 | 0.08 | 0.00 | 0.00 | 0.04 | 0.04 | 0.08 | 0.00 | 0.04 |
2 | 0.08 | 0.04 | 0.04 | 0.04 | 0.00 | 0.08 | 0.04 | 0.12 | 0.20 | 0.00 | 0.04 | 0.00 | 0.04 | 0.00 | 0.08 | 0.04 | 0.04 | 0.04 | 0.08 | 0.00 |
3 | 0.12 | 0.04 | 0.00 | 0.08 | 0.00 | 0.08 | 0.04 | 0.04 | 0.04 | 0.12 | 0.08 | 0.04 | 0.08 | 0.00 | 0.04 | 0.04 | 0.04 | 0.00 | 0.04 | 0.08 |
4 | 0.00 | 0.16 | 0.08 | 0.08 | 0.08 | 0.00 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.08 | 0.08 | 0.00 | 0.12 | 0.04 | 0.00 | 0.00 | 0.04 |
5 | 0.04 | 0.08 | 0.08 | 0.00 | 0.08 | 0.04 | 0.08 | 0.04 | 0.04 | 0.08 | 0.00 | 0.00 | 0.00 | 0.04 | 0.16 | 0.08 | 0.04 | 0.00 | 0.04 | 0.08 |
6 | 0.08 | 0.04 | 0.00 | 0.12 | 0.04 | 0.08 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.00 | 0.16 | 0.04 | 0.04 | 0.08 | 0.04 | 0.00 |
7 | 0.04 | 0.00 | 0.04 | 0.08 | 0.08 | 0.04 | 0.00 | 0.00 | 0.00 | 0.16 | 0.00 | 0.00 | 0.12 | 0.12 | 0.04 | 0.00 | 0.12 | 0.04 | 0.08 | 0.04 |
8 | 0.12 | 0.08 | 0.04 | 0.04 | 0.08 | 0.16 | 0.04 | 0.04 | 0.00 | 0.04 | 0.04 | 0.08 | 0.04 | 0.00 | 0.00 | 0.04 | 0.04 | 0.08 | 0.00 | 0.04 |
9 | 0.04 | 0.12 | 0.08 | 0.04 | 0.12 | 0.04 | 0.04 | 0.04 | 0.12 | 0.00 | 0.08 | 0.00 | 0.00 | 0.12 | 0.04 | 0.04 | 0.04 | 0.04 | 0.00 | 0.00 |
10 | 0.08 | 0.04 | 0.08 | 0.04 | 0.00 | 0.04 | 0.04 | 0.00 | 0.00 | 0.04 | 0.04 | 0.00 | 0.12 | 0.08 | 0.04 | 0.08 | 0.20 | 0.04 | 0.04 | 0.00 |
11 | 0.08 | 0.12 | 0.12 | 0.04 | 0.00 | 0.08 | 0.00 | 0.08 | 0.04 | 0.04 | 0.04 | 0.08 | 0.00 | 0.00 | 0.12 | 0.04 | 0.08 | 0.00 | 0.04 | 0.00 |
12 | 0.04 | 0.04 | 0.04 | 0.08 | 0.00 | 0.16 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 0.04 | 0.04 | 0.04 | 0.00 | 0.08 | 0.00 | 0.00 | 0.24 | 0.16 |
13 | 0.00 | 0.00 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.12 | 0.04 | 0.08 | 0.00 | 0.12 | 0.12 | 0.00 | 0.08 | 0.00 | 0.04 | 0.08 | 0.08 | 0.04 |
14 | 0.12 | 0.08 | 0.04 | 0.08 | 0.04 | 0.04 | 0.04 | 0.00 | 0.12 | 0.08 | 0.00 | 0.00 | 0.04 | 0.00 | 0.12 | 0.08 | 0.00 | 0.00 | 0.08 | 0.04 |
I should have the updated version on PyPI soon, and bioconda should follow
I'm currently experiencing a similar problem with the alphabet "ambig DNA". When creating an ambig DNA Pfm, it looks like it's immediately converted into a DNA Pfm, while my goal is to display an ambiguous DNA sequence.
When writing
A = seqlogo.Pfm(np.random.randint(0,36,size=(8,16)),alphabet_type="ambig DNA")
an object with alphabet_type "DNA" is returned
While I am looking into this @eloilit, can you actually create a new issue for this please?