Indexing syntax error on pandas dataframe
webermarcolivier opened this issue · 3 comments
Describe the bug
Maybe I am missing something, but I am trying to create a simple example of logo in python, defining a PFM matrix with 23 amino acid alphabet, and I get some pandas indexing syntax error in the _update_pm
method. Since the internal type of the PM is a pandas dataframe, shouldn't it be indexed with self._get_pm.iloc[:,:-1]
instead of self._get_pm[:,:-1]
?
To Reproduce
import seqlogo
import numpy as np
pfmArr = np.array([[0. , 0. , 0. , 0. , 0. ,
1. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[0.34285714, 0.01428571, 0. , 0. , 0. ,
0. , 0.01428571, 0. , 0.15714286, 0. ,
0.02857143, 0. , 0. , 0.02857143, 0.15714286,
0.2 , 0.04285714, 0.01428571, 0. , 0. ,
0. , 0. , 0. ],
[0.01428571, 0. , 0. , 0. , 0. ,
0. , 0.04285714, 0.05714286, 0.81428571, 0. ,
0. , 0. , 0. , 0. , 0.04285714,
0. , 0.01428571, 0.01428571, 0. , 0. ,
0. , 0. , 0. ],
[0. , 0. , 0.02857143, 0. , 0. ,
0.07142857, 0.02857143, 0. , 0.62857143, 0.01428571,
0. , 0.1 , 0. , 0. , 0.12857143,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. ],
[0.08571429, 0. , 0. , 0.01428571, 0. ,
0. , 0.02857143, 0. , 0.12857143, 0.04285714,
0. , 0.11428571, 0. , 0.05714286, 0.48571429,
0. , 0.01428571, 0. , 0. , 0.02857143,
0. , 0. , 0. ]])
print(pfmArr.shape)
seqlogo.Pfm(pm_filename_or_array=pfmArr, alphabet_type='reduced AA')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-754-fca28fc08af2> in <module>
28 0. , 0. , 0. ]])
29 print(pfmArr.shape)
---> 30 seqlogo.Pfm(pm_filename_or_array=pfmArr, alphabet_type='reduced AA')
~/.local/anaconda3/lib/python3.6/site-packages/seqlogo/core.py in __init__(self, *args, **kwargs)
450
451 def __init__(self, *args, **kwargs):
--> 452 super().__init__(*args, pm_type='pfm', **kwargs)
453
454 @classmethod
~/.local/anaconda3/lib/python3.6/site-packages/seqlogo/core.py in __init__(self, pm_filename_or_array, pm_type, alphabet_type, alphabet, background, pseudocount)
175
176 if pm_filename_or_array is not None:
--> 177 self._update_pm(pm_filename_or_array, pm_type, alphabet_type, alphabet, self.background, self.pseudocount)
178
179 def _update_pm(self, pm, pm_type ='ppm', alphabet_type = 'DNA', alphabet = None, background = None, pseudocount = None):
~/.local/anaconda3/lib/python3.6/site-packages/seqlogo/core.py in _update_pm(self, pm, pm_type, alphabet_type, alphabet, background, pseudocount)
186 raise ValueError('pseudocount must be the same length as sequence or a constant')
187 if self._alphabet_type not in ("DNA", "RNA", "AA"):
--> 188 self._weight = self._get_pm[:,:-1].sum(axis=1)/self._get_pm.sum(axis=1)
189 else:
190 self._weight = np.ones((self.width,), dtype=np.int8)
~/.local/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
2683 return self._getitem_multilevel(key)
2684 else:
-> 2685 return self._getitem_column(key)
2686
2687 def _getitem_column(self, key):
~/.local/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
2690 # get column
2691 if self.columns.is_unique:
-> 2692 return self._get_item_cache(key)
2693
2694 # duplicate columns & possible reduce dimensionality
~/.local/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
2482 """Return the cached item, item represents a label indexer."""
2483 cache = self._item_cache
-> 2484 res = cache.get(item)
2485 if res is None:
2486 values = self._data.get(item)
TypeError: unhashable type: 'slice'
Desktop:
- OS: Ubuntu 18.04
- Python 3.6.5
- NumPy 1.14.2
- Pandas 0.23.0
- matplotlib 3.0.2
For transparency's sake, my lab and students were only working with nucleotides, so I didn't get a lot of testing in the amino acid side of things. Thanks for bringing this to my attention! I will look into it later today.
Sorry for the delay. This has turned into a bigger issue with the underlying code than originally thought. Will work on it over the weekend.
Okay, I believe I fixed it. Please, feel free to do some more beta-testing on this.
For reference sake (since the literature is kind of vague); I make the assumption that any ambiguous letter (N, *, Y, V, etc) counts observed can be divided among its respective degenerate letters {"V": ("A", "C", "G")}
. Furthermore, both the N
and -
letters (and their respective amino acid partners) are used to calculate the weights of each of the positions. That means that if a position has lots of gaps/all-encompassing letters, that position will be essentially blank.
Hopefully this helps.