betteridiot/seqlogo

Matrix conversion and creation does not work when using amino acid sequences

weitzner opened this issue · 6 comments

Description of the bug
The library doesn't work with amino acid sequences. There are lots of internal parts that seem to omit the supplied alphabet_type parameter, which causes it to default to DNA. Then it throws an error because there are more than 4 letters.

To Reproduce

pfm = np.zeros((len(list_of_seqs), 20))
for seq in list_of_seqs:
    for i, aa in enumerate(seq):
        pfm[i, aa_alphabet.index(aa)] += 1
pfm = seqlogo.Pfm(pfm, alphabet_type="AA")
ppm = seqlogo.Ppm(seqlogo.pfm2ppm(pfm, alphabet_type="AA"), alphabet_type="AA")

This raises

ValueError: None alphabet selected, but PM is not 4 rows long

Expected behavior
This should create a Ppm instance.

Additional context
I found a slight workaround by creating a Pfm instead. This then fails when attempting to produce the weblogo itself:

pfm = np.zeros((len(list_of_seqs), 20))
for seq in list_of_seqs:
    for i, aa in enumerate(seq):
        pfm[i, aa_alphabet.index(aa)] += 1
pfm = seqlogo.Pfm(pfm, alphabet_type="AA")
ppm = seqlogo.Pfm(seqlogo.pfm2ppm(pfm, alphabet_type="AA"), alphabet_type = "AA")

seqlogo.seqlogo(ppm, ic_scale = False, size = 'medium')

However, this raises

TypeError: must be str, not bytes

which is surprising because this is a python 3.x only library.

I see your issue and thank you for bringing it to my attention. I will have to do a little "play testing" on my end to get reacquainted with the code base.

In an attempt to not minimize the issue you posted, if your work is a bit time sensitive and in need of a quick fix, please also consider @jbkinney's logomaker as its plotting capabilities are bit more robust and fleshed out compared to mine--with no dependency on weblogo as well.

Hopefully, though, I can resolve your issue quickly. Sorry for the inconvenience.

Sorry for the long delay in responding. Campus life got a little real and I finally have some bandwidth for this.

The problem doesn't actually arise in the creating of the seqlogo.Pfm:

import random
import numpy as np
import pandas as pd

import seqlogo


def gen_seqs(n, length=25, alphabet_type='AA', seed = None):
    """Pseudo-random generation of simulated sequences
    
    Args:
        n (int): The desired number of sequences
        length (int): The length (bp/aa) of the desired sequences (default: 25)
        alphabet_type (str): Desired alphabet to sample from (default: 'AA')
            Choose from:
                "DNA" := "ACGT"
                "reduced DNA" := "ACGTN-"
                "ambig DNA" := "ACGTRYSWKMBDHVN-"
                "RNA" := "ACGU"
                "reduced RNA" := "ACGUN-"
                "ambig RNA" := "ACGURYSWKMBDHVN-"
                "AA" : = "ACDEFGHIKLMNPQRSTVWY"
                "reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-"
                "ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-"
        seed (int): random seed for testing (default: None)
    
    Returns:
        (list): n length list of length long sequences 
    """
    random.seed(seed)
    return [''.join(random.choices(seqlogo.utils._IDX_LETTERS[alphabet_type], k=length)) for _ in range(n)]


# Make my simulated sequences
alphabet_type = 'AA'
list_of_seqs = gen_seqs(15, alphabet_type = alphabet_type)

# For duplicating aa_alphabet using supported alphabets
aa_alphabet = seqlogo.utils._IDX_LETTERS[alphabet_type]

# Make the sample PFM (MODIFIED FROM ISSUE)
pfm = np.zeros((len(list_of_seqs), 20))
for i, seq in enumerate(list_of_seqs): 
    for aa in seq:
        pfm[i, aa_alphabet.index(aa)] += 1

# Convert into a seqlogo.Pfm
pfm = seqlogo.Pfm(pfm, alphabet_type="AA")

# Get some summary data
print(f"""{pfm}
Pfm alphabet_type:  {pfm.alphabet_type}
Pfm alphabet:       {pfm.alphabet}
Number of Rows:     {pfm.shape[0]}
Number of Columns : {pfm.shape[1]}
""")
A C D E F G H I K L M N P Q R S T V W Y
0 1.0 1.0 0.0 1.0 1.0 5.0 2.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0 0.0 3.0 3.0 3.0 0.0 0.0
1 2.0 2.0 0.0 1.0 5.0 1.0 2.0 0.0 0.0 1.0 0.0 1.0 1.0 2.0 1.0 2.0 0.0 3.0 1.0 0.0
2 3.0 1.0 0.0 1.0 1.0 0.0 1.0 2.0 1.0 0.0 5.0 0.0 2.0 0.0 1.0 1.0 2.0 1.0 3.0 0.0
3 2.0 0.0 4.0 1.0 1.0 3.0 1.0 1.0 0.0 0.0 1.0 1.0 2.0 2.0 2.0 1.0 0.0 0.0 3.0 0.0
4 1.0 2.0 1.0 3.0 4.0 0.0 2.0 0.0 1.0 3.0 0.0 2.0 1.0 2.0 1.0 0.0 1.0 0.0 1.0 0.0
5 2.0 3.0 1.0 3.0 0.0 1.0 0.0 2.0 2.0 1.0 1.0 2.0 1.0 2.0 1.0 1.0 2.0 0.0 0.0 0.0
6 0.0 3.0 0.0 1.0 3.0 3.0 3.0 4.0 3.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0
7 2.0 0.0 0.0 1.0 0.0 3.0 2.0 1.0 3.0 0.0 0.0 2.0 1.0 0.0 2.0 1.0 1.0 1.0 1.0 4.0
8 1.0 1.0 0.0 2.0 2.0 1.0 1.0 1.0 2.0 1.0 0.0 1.0 2.0 1.0 3.0 2.0 0.0 1.0 2.0 1.0
9 1.0 2.0 2.0 1.0 0.0 3.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 2.0 4.0
10 0.0 3.0 2.0 0.0 5.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 2.0 2.0 0.0 1.0 1.0 2.0 2.0 2.0
11 3.0 1.0 1.0 1.0 1.0 0.0 3.0 0.0 1.0 3.0 2.0 1.0 2.0 2.0 0.0 0.0 3.0 1.0 0.0 0.0
12 2.0 1.0 0.0 0.0 4.0 2.0 1.0 0.0 0.0 2.0 3.0 3.0 1.0 0.0 1.0 2.0 0.0 1.0 1.0 1.0
13 0.0 1.0 0.0 1.0 0.0 3.0 1.0 1.0 6.0 1.0 2.0 1.0 1.0 1.0 1.0 0.0 2.0 0.0 2.0 1.0
14 2.0 1.0 2.0 1.0 1.0 0.0 0.0 1.0 2.0 0.0 1.0 0.0 4.0 1.0 1.0 4.0 0.0 1.0 2.0 1.0
Pfm alphabet_type:  AA
Pfm alphabet:       ACDEFGHIKLMNPQRSTVWY
Number of Rows:     15
Number of Columns : 20

Converting the PFM to a PPM doesn't seem to be the issue either:

seqlogo.pfm2ppm(pfm, alphabet_type='AA')
A C D E F G H I K L M N P Q R S T V W Y
0 0.00 0.08 0.00 0.08 0.12 0.04 0.04 0.00 0.00 0.12 0.08 0.04 0.04 0.00 0.04 0.12 0.04 0.00 0.08 0.08
1 0.12 0.16 0.00 0.12 0.00 0.04 0.08 0.00 0.04 0.08 0.08 0.04 0.00 0.04 0.08 0.04 0.04 0.00 0.04 0.00
2 0.00 0.00 0.08 0.08 0.04 0.00 0.08 0.00 0.12 0.04 0.04 0.04 0.04 0.08 0.08 0.08 0.00 0.08 0.04 0.08
3 0.20 0.04 0.00 0.04 0.04 0.00 0.00 0.08 0.12 0.04 0.00 0.12 0.04 0.04 0.04 0.04 0.04 0.04 0.00 0.08
4 0.00 0.12 0.04 0.08 0.00 0.12 0.08 0.00 0.00 0.04 0.08 0.08 0.00 0.08 0.04 0.00 0.08 0.04 0.04 0.08
5 0.08 0.04 0.08 0.00 0.04 0.08 0.08 0.00 0.00 0.08 0.04 0.04 0.12 0.04 0.00 0.08 0.08 0.08 0.04 0.00
6 0.16 0.04 0.04 0.04 0.04 0.08 0.04 0.04 0.04 0.00 0.04 0.04 0.04 0.08 0.04 0.08 0.08 0.00 0.08 0.00
7 0.00 0.00 0.04 0.00 0.04 0.00 0.12 0.00 0.04 0.20 0.00 0.00 0.12 0.00 0.12 0.00 0.12 0.08 0.00 0.12
8 0.08 0.00 0.04 0.04 0.04 0.08 0.00 0.04 0.00 0.00 0.08 0.00 0.20 0.00 0.12 0.04 0.04 0.08 0.00 0.12
9 0.12 0.08 0.12 0.04 0.00 0.04 0.00 0.12 0.08 0.00 0.00 0.12 0.08 0.00 0.00 0.16 0.04 0.00 0.00 0.00
10 0.00 0.00 0.04 0.04 0.04 0.00 0.12 0.04 0.08 0.04 0.04 0.04 0.12 0.16 0.00 0.00 0.12 0.04 0.04 0.04
11 0.08 0.04 0.08 0.04 0.08 0.04 0.00 0.04 0.08 0.08 0.00 0.00 0.04 0.00 0.04 0.00 0.12 0.12 0.08 0.04
12 0.08 0.00 0.08 0.04 0.04 0.04 0.08 0.00 0.00 0.04 0.00 0.00 0.08 0.20 0.08 0.04 0.04 0.12 0.00 0.04
13 0.08 0.00 0.00 0.08 0.08 0.04 0.04 0.24 0.04 0.04 0.00 0.00 0.00 0.00 0.00 0.16 0.00 0.08 0.08 0.04
14 0.04 0.00 0.12 0.12 0.12 0.08 0.00 0.04 0.04 0.08 0.08 0.00 0.08 0.04 0.00 0.04 0.00 0.00 0.08 0.04

The Problem

The issue arises when we seqlogo uses the pfm2ppm function...which you highlighted on. This is in part due to some inheritance issues. In my development build, that is fixed.

The second issue (one that you brought up) is that certain arguments from the conversion functions (e.g. pfm2ppm) don't carry over to calls to Ppm.

I have fixed this as well, and should have an update soon.

With these changes, the following works:

ppm = seqlogo.Ppm(seqlogo.pfm2ppm(pfm, alphabet_type='AA'), alphabet_type = 'AA')
ppm
A C D E F G H I K L M N P Q R S T V W Y
0 0.04 0.12 0.04 0.08 0.04 0.04 0.04 0.00 0.08 0.00 0.00 0.08 0.04 0.00 0.08 0.00 0.16 0.12 0.00 0.04
1 0.12 0.04 0.08 0.08 0.00 0.04 0.04 0.12 0.04 0.00 0.04 0.12 0.08 0.00 0.00 0.04 0.04 0.08 0.00 0.04
2 0.08 0.04 0.04 0.04 0.00 0.08 0.04 0.12 0.20 0.00 0.04 0.00 0.04 0.00 0.08 0.04 0.04 0.04 0.08 0.00
3 0.12 0.04 0.00 0.08 0.00 0.08 0.04 0.04 0.04 0.12 0.08 0.04 0.08 0.00 0.04 0.04 0.04 0.00 0.04 0.08
4 0.00 0.16 0.08 0.08 0.08 0.00 0.20 0.00 0.00 0.00 0.00 0.04 0.08 0.08 0.00 0.12 0.04 0.00 0.00 0.04
5 0.04 0.08 0.08 0.00 0.08 0.04 0.08 0.04 0.04 0.08 0.00 0.00 0.00 0.04 0.16 0.08 0.04 0.00 0.04 0.08
6 0.08 0.04 0.00 0.12 0.04 0.08 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.00 0.16 0.04 0.04 0.08 0.04 0.00
7 0.04 0.00 0.04 0.08 0.08 0.04 0.00 0.00 0.00 0.16 0.00 0.00 0.12 0.12 0.04 0.00 0.12 0.04 0.08 0.04
8 0.12 0.08 0.04 0.04 0.08 0.16 0.04 0.04 0.00 0.04 0.04 0.08 0.04 0.00 0.00 0.04 0.04 0.08 0.00 0.04
9 0.04 0.12 0.08 0.04 0.12 0.04 0.04 0.04 0.12 0.00 0.08 0.00 0.00 0.12 0.04 0.04 0.04 0.04 0.00 0.00
10 0.08 0.04 0.08 0.04 0.00 0.04 0.04 0.00 0.00 0.04 0.04 0.00 0.12 0.08 0.04 0.08 0.20 0.04 0.04 0.00
11 0.08 0.12 0.12 0.04 0.00 0.08 0.00 0.08 0.04 0.04 0.04 0.08 0.00 0.00 0.12 0.04 0.08 0.00 0.04 0.00
12 0.04 0.04 0.04 0.08 0.00 0.16 0.00 0.00 0.00 0.04 0.00 0.04 0.04 0.04 0.00 0.08 0.00 0.00 0.24 0.16
13 0.00 0.00 0.04 0.04 0.04 0.04 0.04 0.12 0.04 0.08 0.00 0.12 0.12 0.00 0.08 0.00 0.04 0.08 0.08 0.04
14 0.12 0.08 0.04 0.08 0.04 0.04 0.04 0.00 0.12 0.08 0.00 0.00 0.04 0.00 0.12 0.08 0.00 0.00 0.08 0.04

I should have the updated version on PyPI soon, and bioconda should follow

I'm currently experiencing a similar problem with the alphabet "ambig DNA". When creating an ambig DNA Pfm, it looks like it's immediately converted into a DNA Pfm, while my goal is to display an ambiguous DNA sequence.

When writing
A = seqlogo.Pfm(np.random.randint(0,36,size=(8,16)),alphabet_type="ambig DNA")

an object with alphabet_type "DNA" is returned

While I am looking into this @eloilit, can you actually create a new issue for this please?