markovmodel/PyEMMA

How to proceed with disconnected MSM

Closed this issue · 1 comments

Hi everyone,

I'm following the BPTI tutorial with a system I'm studying (a ligand binding to a protein). I'm just testing around and learning how to use your software, so for the moment I'm just proceeding forward with the MSM building workflow without doing much hyper parameter tuning. My dataset in total is 900 ns (45000 frames) for a 3800 atom system (after getting rid of solvent).

So, I am using a distance featurizer between the ligand and the alpha carbons of the protein, a tICA reduction to 5 dimensions and a clustering of a 1000 microstates with kmeans. I decided to build an MSM with a lag time of 8 ns (400 steps) after checking the timescale plots:
its

The problem I'm facing is that the MSM is not fully connected, and is only using 944 out of the 1000 clusters. When I try to reproduce the figure for how the 2nd slowest right eigenvector looks like, I get the following error:

cl = pyemma.coordinates.cluster_kmeans(data=Y, k=1000, stride=1)
M = msm.estimate_markov_model(cl.dtrajs, lag=400)

cc_x = cl.clustercenters[:,0]
cc_y = cl.clustercenters[:,1]
r2 = M.eigenvectors_right()[:,1]
r2 = M.eigenvectors_right()[:,1]
ax = mplt.scatter_contour(cc_x, cc_y, r2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-78-8f489d57713b> in <module>()
      1 r2 = M.eigenvectors_right()[:,1]
      2 r2 = M.eigenvectors_right()[:,1]
----> 3 ax = mplt.scatter_contour(cc_x, cc_y, r2)

/Users/je714/anaconda3/lib/python3.5/site-packages/pyEMMA-2.2.3-py3.5-macosx-10.6-x86_64.egg/pyemma/plots/plots2d.py in scatter_contour(x, y, z, ncontours, colorbar, fig, ax, cmap, outfile)
     82     """
     83     import matplotlib.pylab as _plt
---> 84     ax = contour(x, y, z, ncontours=ncontours, colorbar=colorbar, fig=fig, ax=ax, cmap=cmap)
     85 
     86     # scatter points

/Users/je714/anaconda3/lib/python3.5/site-packages/pyEMMA-2.2.3-py3.5-macosx-10.6-x86_64.egg/pyemma/plots/plots2d.py in contour(x, y, z, ncontours, colorbar, fig, ax, method, zlim, cmap)
     37     points = _np.hstack([x[:,None],y[:,None]])
     38     xi, yi = _np.mgrid[x.min():x.max():100j, y.min():y.max():100j]
---> 39     zi = gd(points, z, (xi, yi), method=method)
     40     # contour level levels
     41     if zlim is None:

/Users/je714/anaconda3/lib/python3.5/site-packages/scipy/interpolate/ndgriddata.py in griddata(points, values, xi, method, fill_value, rescale)
    215     elif method == 'linear':
    216         ip = LinearNDInterpolator(points, values, fill_value=fill_value,
--> 217                                   rescale=rescale)
    218         return ip(xi)
    219     elif method == 'cubic' and ndim == 2:

scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.LinearNDInterpolator.__init__ (scipy/interpolate/interpnd.c:4934)()

scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.NDInterpolatorBase.__init__ (scipy/interpolate/interpnd.c:2386)()

scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd._check_init_shape (scipy/interpolate/interpnd.c:4556)()

ValueError: different number of values and points

Which I understand comes from the fact that I have more cluster centers than sets were used to build the MSM. Do I just have to wait until the active_count_fraction of my estimated MSM is 1 to carry on with my MSM analysis?

As a side note, I've also tried using as metric the contacts between the alpha carbons of the protein and my ligand but the tICA projections look very different from what I get with the distance metric. I'm thinking it might be related to the fact that the covariance matrix is sparse but I'm not sure. (this is off topic from my question but if someone has encountered this before I would appreciate some input 😃)
tica2

Thanks for putting out the tutorials, they help out quite a lot with understanding how the package works.

Hi, you can use M.active_set to get the indexes of states in the active
set (in this case largest connected set). With that you can stride the
clusters you want to visualize. We are thinking about doing this in a
more convenient way, but this is how it workes at the moment.

Am 12/09/16 um 17:17 schrieb Juan Eiros:

Hi everyone,

I'm following the BPTI tutorial
http://www.emma-project.org/latest/generated/MSM_BPTI.html with a
system I'm studying (a ligand binding to a protein). I'm just testing
around and learning how to use your software, so for the moment I'm
just proceeding forward with the MSM building workflow without doing
much hyper parameter tuning. My dataset in total is 900 ns (45000
frames) for a 3800 atom system (after getting rid of solvent).

So, I am using a distance featurizer between the ligand and the alpha
carbons of the protein, a tICA reduction to 5 dimensions and a
clustering of a 1000 microstates with kmeans. I decided to build an
MSM with a lag time of 8 ns (400 steps) after checking the timescale
plots:
its
https://cloud.githubusercontent.com/assets/10696140/18440643/e2813a04-7901-11e6-8e0c-a509b765e929.png

The problem I'm facing is that the MSM is not fully connected, and is
only using 944 out of the 1000 clusters. When I try to reproduce the
figure for how the 2nd slowest right eigenvector looks like, I get the
following error:

cl= pyemma.coordinates.cluster_kmeans(data=Y,k=1000,stride=1)
M = msm.estimate_markov_model(cl.dtrajs,lag=400)

cc_x= cl.clustercenters[:,0]
cc_y= cl.clustercenters[:,1]
r2= M.eigenvectors_right()[:,1]
r2= M.eigenvectors_right()[:,1]

ax= mplt.scatter_contour(cc_x, cc_y, r2)
ValueError Traceback (most recent call last)
in () 1 r2 =
M.eigenvectors_right()[:,1] 2 r2 = M.eigenvectors_right()[:,1] ----> 3
ax = mplt.scatter_contour(cc_x, cc_y, r2)
/Users/je714/anaconda3/lib/python3.5/site-packages/pyEMMA-2.2.3-py3.5-macosx-10.6-x86_64.egg/pyemma/plots/plots2d.py
in scatter_contour(x, y, z, ncontours, colorbar, fig, ax, cmap,
outfile) 82 """ 83 import matplotlib.pylab as _plt ---> 84 ax =
contour(x, y, z, ncontours=ncontours, colorbar=colorbar, fig=fig,
ax=ax, cmap=cmap) 85 86 # scatter points
/Users/je714/anaconda3/lib/python3.5/site-packages/pyEMMA-2.2.3-py3.5-macosx-10.6-x86_64.egg/pyemma/plots/plots2d.py
in contour(x, y, z, ncontours, colorbar, fig, ax, method, zlim, cmap)
37 points = _np.hstack([x[:,None],y[:,None]]) 38 xi, yi =
_np.mgrid[x.min():x.max():100j, y.min():y.max():100j] ---> 39 zi =
gd(points, z, (xi, yi), method=method) 40 # contour level levels 41 if
zlim is None:
/Users/je714/anaconda3/lib/python3.5/site-packages/scipy/interpolate/ndgriddata.py
in griddata(points, values, xi, method, fill_value, rescale) 215 elif
method == 'linear': 216 ip = LinearNDInterpolator(points, values,
fill_value=fill_value, --> 217 rescale=rescale) 218 return ip(xi) 219
elif method == 'cubic' and ndim == 2: scipy/interpolate/interpnd.pyx
in scipy.interpolate.interpnd.LinearNDInterpolator.init
(scipy/interpolate/interpnd.c:4934)() scipy/interpolate/interpnd.pyx
in scipy.interpolate.interpnd.NDInterpolatorBase.init
(scipy/interpolate/interpnd.c:2386)() scipy/interpolate/interpnd.pyx
in scipy.interpolate.interpnd._check_init_shape
(scipy/interpolate/interpnd.c:4556)() ValueError: different number of
values and points

Which I understand comes from the fact that I have more cluster
centers than sets were used to build the MSM. Do I just have to wait
until the |active_count_fraction| of my estimated MSM is 1 to carry on
with my MSM analysis?

As a side note, I've also tried using as metric the contacts between
the alpha carbons of the protein and my ligand but the tICA
projections look very different from what I get with the distance
metric. I'm thinking it might be related to the fact that the
covariance matrix is sparse but I'm not sure. (this is off topic from
my question but if someone has encountered this before I would
appreciate some input 😃)
tica2
https://cloud.githubusercontent.com/assets/10696140/18441275/557743ee-7904-11e6-97da-834ec896d875.png

Thanks for putting out the tutorials, they help out quite a lot with
understanding how the package works.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#928, or mute the thread
https://github.com/notifications/unsubscribe-auth/AGMeQiKRtoHWMAULXammOptLlzLFFYNwks5qpW0agaJpZM4J6s1h.


Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin

Phone: (+49) (0)30 838 75354
Web: research.franknoe.de

Mail: Arnimallee 6, 14195 Berlin, Germany