betteridiot/seqlogo

Indexing syntax error on pandas dataframe

webermarcolivier opened this issue · 3 comments

Describe the bug

Maybe I am missing something, but I am trying to create a simple example of logo in python, defining a PFM matrix with 23 amino acid alphabet, and I get some pandas indexing syntax error in the _update_pm method. Since the internal type of the PM is a pandas dataframe, shouldn't it be indexed with self._get_pm.iloc[:,:-1] instead of self._get_pm[:,:-1]?

To Reproduce

import seqlogo
import numpy as np

pfmArr = np.array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        1.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        ],
       [0.34285714, 0.01428571, 0.        , 0.        , 0.        ,
        0.        , 0.01428571, 0.        , 0.15714286, 0.        ,
        0.02857143, 0.        , 0.        , 0.02857143, 0.15714286,
        0.2       , 0.04285714, 0.01428571, 0.        , 0.        ,
        0.        , 0.        , 0.        ],
       [0.01428571, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.04285714, 0.05714286, 0.81428571, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.04285714,
        0.        , 0.01428571, 0.01428571, 0.        , 0.        ,
        0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.02857143, 0.        , 0.        ,
        0.07142857, 0.02857143, 0.        , 0.62857143, 0.01428571,
        0.        , 0.1       , 0.        , 0.        , 0.12857143,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        ],
       [0.08571429, 0.        , 0.        , 0.01428571, 0.        ,
        0.        , 0.02857143, 0.        , 0.12857143, 0.04285714,
        0.        , 0.11428571, 0.        , 0.05714286, 0.48571429,
        0.        , 0.01428571, 0.        , 0.        , 0.02857143,
        0.        , 0.        , 0.        ]])
print(pfmArr.shape)
seqlogo.Pfm(pm_filename_or_array=pfmArr, alphabet_type='reduced AA')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-754-fca28fc08af2> in <module>
     28         0.        , 0.        , 0.        ]])
     29 print(pfmArr.shape)
---> 30 seqlogo.Pfm(pm_filename_or_array=pfmArr, alphabet_type='reduced AA')

~/.local/anaconda3/lib/python3.6/site-packages/seqlogo/core.py in __init__(self, *args, **kwargs)
    450 
    451     def __init__(self, *args, **kwargs):
--> 452         super().__init__(*args, pm_type='pfm', **kwargs)
    453 
    454     @classmethod

~/.local/anaconda3/lib/python3.6/site-packages/seqlogo/core.py in __init__(self, pm_filename_or_array, pm_type, alphabet_type, alphabet, background, pseudocount)
    175 
    176         if pm_filename_or_array is not None:
--> 177             self._update_pm(pm_filename_or_array, pm_type, alphabet_type, alphabet, self.background, self.pseudocount)
    178 
    179     def _update_pm(self, pm, pm_type ='ppm', alphabet_type = 'DNA', alphabet = None, background = None, pseudocount = None):

~/.local/anaconda3/lib/python3.6/site-packages/seqlogo/core.py in _update_pm(self, pm, pm_type, alphabet_type, alphabet, background, pseudocount)
    186                 raise ValueError('pseudocount must be the same length as sequence or a constant')
    187         if self._alphabet_type not in ("DNA", "RNA", "AA"):
--> 188             self._weight = self._get_pm[:,:-1].sum(axis=1)/self._get_pm.sum(axis=1)
    189         else:
    190             self._weight = np.ones((self.width,), dtype=np.int8)

~/.local/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2683             return self._getitem_multilevel(key)
   2684         else:
-> 2685             return self._getitem_column(key)
   2686 
   2687     def _getitem_column(self, key):

~/.local/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2690         # get column
   2691         if self.columns.is_unique:
-> 2692             return self._get_item_cache(key)
   2693 
   2694         # duplicate columns & possible reduce dimensionality

~/.local/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   2482         """Return the cached item, item represents a label indexer."""
   2483         cache = self._item_cache
-> 2484         res = cache.get(item)
   2485         if res is None:
   2486             values = self._data.get(item)

TypeError: unhashable type: 'slice'

Desktop:

  • OS: Ubuntu 18.04
  • Python 3.6.5
  • NumPy 1.14.2
  • Pandas 0.23.0
  • matplotlib 3.0.2

For transparency's sake, my lab and students were only working with nucleotides, so I didn't get a lot of testing in the amino acid side of things. Thanks for bringing this to my attention! I will look into it later today.

Sorry for the delay. This has turned into a bigger issue with the underlying code than originally thought. Will work on it over the weekend.

Okay, I believe I fixed it. Please, feel free to do some more beta-testing on this.

For reference sake (since the literature is kind of vague); I make the assumption that any ambiguous letter (N, *, Y, V, etc) counts observed can be divided among its respective degenerate letters {"V": ("A", "C", "G")}. Furthermore, both the N and - letters (and their respective amino acid partners) are used to calculate the weights of each of the positions. That means that if a position has lots of gaps/all-encompassing letters, that position will be essentially blank.

Hopefully this helps.