How to proceed with disconnected MSM
Closed this issue · 1 comments
Hi everyone,
I'm following the BPTI tutorial with a system I'm studying (a ligand binding to a protein). I'm just testing around and learning how to use your software, so for the moment I'm just proceeding forward with the MSM building workflow without doing much hyper parameter tuning. My dataset in total is 900 ns (45000 frames) for a 3800 atom system (after getting rid of solvent).
So, I am using a distance featurizer between the ligand and the alpha carbons of the protein, a tICA reduction to 5 dimensions and a clustering of a 1000 microstates with kmeans. I decided to build an MSM with a lag time of 8 ns (400 steps) after checking the timescale plots:
The problem I'm facing is that the MSM is not fully connected, and is only using 944 out of the 1000 clusters. When I try to reproduce the figure for how the 2nd slowest right eigenvector looks like, I get the following error:
cl = pyemma.coordinates.cluster_kmeans(data=Y, k=1000, stride=1)
M = msm.estimate_markov_model(cl.dtrajs, lag=400)
cc_x = cl.clustercenters[:,0]
cc_y = cl.clustercenters[:,1]
r2 = M.eigenvectors_right()[:,1]
r2 = M.eigenvectors_right()[:,1]
ax = mplt.scatter_contour(cc_x, cc_y, r2)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-78-8f489d57713b> in <module>()
1 r2 = M.eigenvectors_right()[:,1]
2 r2 = M.eigenvectors_right()[:,1]
----> 3 ax = mplt.scatter_contour(cc_x, cc_y, r2)
/Users/je714/anaconda3/lib/python3.5/site-packages/pyEMMA-2.2.3-py3.5-macosx-10.6-x86_64.egg/pyemma/plots/plots2d.py in scatter_contour(x, y, z, ncontours, colorbar, fig, ax, cmap, outfile)
82 """
83 import matplotlib.pylab as _plt
---> 84 ax = contour(x, y, z, ncontours=ncontours, colorbar=colorbar, fig=fig, ax=ax, cmap=cmap)
85
86 # scatter points
/Users/je714/anaconda3/lib/python3.5/site-packages/pyEMMA-2.2.3-py3.5-macosx-10.6-x86_64.egg/pyemma/plots/plots2d.py in contour(x, y, z, ncontours, colorbar, fig, ax, method, zlim, cmap)
37 points = _np.hstack([x[:,None],y[:,None]])
38 xi, yi = _np.mgrid[x.min():x.max():100j, y.min():y.max():100j]
---> 39 zi = gd(points, z, (xi, yi), method=method)
40 # contour level levels
41 if zlim is None:
/Users/je714/anaconda3/lib/python3.5/site-packages/scipy/interpolate/ndgriddata.py in griddata(points, values, xi, method, fill_value, rescale)
215 elif method == 'linear':
216 ip = LinearNDInterpolator(points, values, fill_value=fill_value,
--> 217 rescale=rescale)
218 return ip(xi)
219 elif method == 'cubic' and ndim == 2:
scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.LinearNDInterpolator.__init__ (scipy/interpolate/interpnd.c:4934)()
scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.NDInterpolatorBase.__init__ (scipy/interpolate/interpnd.c:2386)()
scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd._check_init_shape (scipy/interpolate/interpnd.c:4556)()
ValueError: different number of values and points
Which I understand comes from the fact that I have more cluster centers than sets were used to build the MSM. Do I just have to wait until the active_count_fraction
of my estimated MSM is 1 to carry on with my MSM analysis?
As a side note, I've also tried using as metric the contacts between the alpha carbons of the protein and my ligand but the tICA projections look very different from what I get with the distance metric. I'm thinking it might be related to the fact that the covariance matrix is sparse but I'm not sure. (this is off topic from my question but if someone has encountered this before I would appreciate some input 😃)
Thanks for putting out the tutorials, they help out quite a lot with understanding how the package works.
Hi, you can use M.active_set to get the indexes of states in the active
set (in this case largest connected set). With that you can stride the
clusters you want to visualize. We are thinking about doing this in a
more convenient way, but this is how it workes at the moment.
Am 12/09/16 um 17:17 schrieb Juan Eiros:
Hi everyone,
I'm following the BPTI tutorial
http://www.emma-project.org/latest/generated/MSM_BPTI.html with a
system I'm studying (a ligand binding to a protein). I'm just testing
around and learning how to use your software, so for the moment I'm
just proceeding forward with the MSM building workflow without doing
much hyper parameter tuning. My dataset in total is 900 ns (45000
frames) for a 3800 atom system (after getting rid of solvent).So, I am using a distance featurizer between the ligand and the alpha
carbons of the protein, a tICA reduction to 5 dimensions and a
clustering of a 1000 microstates with kmeans. I decided to build an
MSM with a lag time of 8 ns (400 steps) after checking the timescale
plots:
its
https://cloud.githubusercontent.com/assets/10696140/18440643/e2813a04-7901-11e6-8e0c-a509b765e929.pngThe problem I'm facing is that the MSM is not fully connected, and is
only using 944 out of the 1000 clusters. When I try to reproduce the
figure for how the 2nd slowest right eigenvector looks like, I get the
following error:cl= pyemma.coordinates.cluster_kmeans(data=Y,k=1000,stride=1)
M = msm.estimate_markov_model(cl.dtrajs,lag=400)cc_x= cl.clustercenters[:,0]
cc_y= cl.clustercenters[:,1]
r2= M.eigenvectors_right()[:,1]
r2= M.eigenvectors_right()[:,1]
ax= mplt.scatter_contour(cc_x, cc_y, r2) ValueError Traceback (most recent call last) in () 1 r2 = M.eigenvectors_right()[:,1] 2 r2 = M.eigenvectors_right()[:,1] ----> 3 ax = mplt.scatter_contour(cc_x, cc_y, r2) /Users/je714/anaconda3/lib/python3.5/site-packages/pyEMMA-2.2.3-py3.5-macosx-10.6-x86_64.egg/pyemma/plots/plots2d.py in scatter_contour(x, y, z, ncontours, colorbar, fig, ax, cmap, outfile) 82 """ 83 import matplotlib.pylab as _plt ---> 84 ax = contour(x, y, z, ncontours=ncontours, colorbar=colorbar, fig=fig, ax=ax, cmap=cmap) 85 86 # scatter points /Users/je714/anaconda3/lib/python3.5/site-packages/pyEMMA-2.2.3-py3.5-macosx-10.6-x86_64.egg/pyemma/plots/plots2d.py in contour(x, y, z, ncontours, colorbar, fig, ax, method, zlim, cmap) 37 points = _np.hstack([x[:,None],y[:,None]]) 38 xi, yi = _np.mgrid[x.min():x.max():100j, y.min():y.max():100j] ---> 39 zi = gd(points, z, (xi, yi), method=method) 40 # contour level levels 41 if zlim is None: /Users/je714/anaconda3/lib/python3.5/site-packages/scipy/interpolate/ndgriddata.py in griddata(points, values, xi, method, fill_value, rescale) 215 elif method == 'linear': 216 ip = LinearNDInterpolator(points, values, fill_value=fill_value, --> 217 rescale=rescale) 218 return ip(xi) 219 elif method == 'cubic' and ndim == 2: scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.LinearNDInterpolator.init (scipy/interpolate/interpnd.c:4934)() scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd.NDInterpolatorBase.init (scipy/interpolate/interpnd.c:2386)() scipy/interpolate/interpnd.pyx in scipy.interpolate.interpnd._check_init_shape (scipy/interpolate/interpnd.c:4556)() ValueError: different number of values and points Which I understand comes from the fact that I have more cluster
centers than sets were used to build the MSM. Do I just have to wait
until the |active_count_fraction| of my estimated MSM is 1 to carry on
with my MSM analysis?As a side note, I've also tried using as metric the contacts between
the alpha carbons of the protein and my ligand but the tICA
projections look very different from what I get with the distance
metric. I'm thinking it might be related to the fact that the
covariance matrix is sparse but I'm not sure. (this is off topic from
my question but if someone has encountered this before I would
appreciate some input 😃)
tica2
https://cloud.githubusercontent.com/assets/10696140/18441275/557743ee-7904-11e6-97da-834ec896d875.pngThanks for putting out the tutorials, they help out quite a lot with
understanding how the package works.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#928, or mute the thread
https://github.com/notifications/unsubscribe-auth/AGMeQiKRtoHWMAULXammOptLlzLFFYNwks5qpW0agaJpZM4J6s1h.
Prof. Dr. Frank Noe
Head of Computational Molecular Biology group
Freie Universitaet Berlin
Phone: (+49) (0)30 838 75354
Web: research.franknoe.de