AG-Peter/encodermap

ValueError: The provided value for n_atoms (808) does not match the shape of the coordinate array (979)

Closed this issue · 12 comments

b8_dihedral_to_cartesian_diubi_analysis.py:74: RuntimeWarning: divide by zero encountered in log caxe = axe1.imshow(-np.log(hist.T), origin='low', extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]], aspect="auto") Traceback (most recent call last): File "/home/danielem/.local/lib/python3.6/site-packages/matplotlib/cbook/init.py", line 216, in process func(*args, **kwargs) File "/home/danielem/encodermap/encodermap/plot.py", line 80, in _on_key self.use_points(points) File "/home/danielem/encodermap/encodermap/plot.py", line 293, in use_points align_reference=self.align_reference, align_select=self.align_select) File "/home/danielem/encodermap/encodermap/moldata.py", line 209, in write output_universe.load_new(coordinates, format=MemoryReader) File "/home/danielem/.local/lib/python3.6/site-packages/MDAnalysis/core/universe.py", line 604, in load_new self.trajectory = reader(filename, **kwargs) File "/home/danielem/.local/lib/python3.6/site-packages/MDAnalysis/coordinates/memory.py", line 370, in init .format(provided_n_atoms, self.n_atoms)) ValueError: The provided value for n_atoms (808) does not match the shape of the coordinate array (979)

Hard to tell what is going wrong exactly without any further information. Somehow the number of coordinates generated does not match the number of atoms in your moldata object. Is it possible that you mixed up one EncoderMap run with a moldata object from a different molecule or a different selection of atoms?

Sorry for the cryptic message I've sent before, now I will try to explain myself better.

"Is it possible that you mixed up one EncoderMap run with a moldata object from a different molecule or a different selection of atoms?"
I don't think so, because I've been using the same molecule and selections from the very beginning.

This are the two modified files that I've used.


import MDAnalysis as md
import os
import numpy as np
import encodermap as em
import tensorflow as tf
import matplotlib.pyplot as plt
import locale
import copy

molname = "it_tremd"
structure_path = "it_tremd.pdb"
trajectory_paths = ["wt_tremd.xtc"]

uni = md.Universe(structure_path, trajectory_paths)
selected_atoms = uni.select_atoms("backbone or name H or name O1 or (name CD and resname PRO)")
moldata = em.MolData(selected_atoms)

######################### Define parameters ##################################

total_steps = 50000
parameters = em.ADCParameters()
parameters.main_path = em.misc.run_path("runs/{}".format(molname))

parameters.cartesian_cost_scale = 0
parameters.cartesian_cost_variant = "mean_abs"
parameters.cartesian_cost_scale_soft_start = (int(total_steps/109), int(total_steps/109)+1000)
parameters.cartesian_pwd_start = 1 # Calculate pairwise distances starting form the second backbone atom ...
parameters.cartesian_pwd_step = 3 # for every third atom. These are the C_alpha atoms

parameters.dihedral_cost_scale = 1
parameters.dihedral_cost_variant = "mean_abs"

parameters.distance_cost_scale = 0 # no distance cost in dihedral space
parameters.cartesian_distance_cost_scale = 100 # instead we use distance cost in C_alpha distance space
parameters.cartesian_dist_sig_parameters = [500, 10, 5, 1, 2, 5]

parameters.checkpoint_step = max(1, int(total_steps/10))
parameters.l2_reg_constant = 0.001
parameters.center_cost_scale = 0
parameters.id = molname

########### Check how distances are mapped with your choice of Sigmoid parameters ##############

pwd = em.misc.pairwise_dist(
moldata.central_cartesians[::100, parameters.cartesian_pwd_start::parameters.cartesian_pwd_step], flat=True)
with tf.Session() as sess:
pwd = sess.run(pwd)
axes = em.plot.distance_histogram(pwd, float("inf"), parameters.cartesian_dist_sig_parameters)
plt.show()

#somehow matplotlib messes with this setting and causes problems in tensorflow
locale.setlocale(locale.LC_NUMERIC, "en_US.UTF-8")

########################## Get references from dummy model ##################################

dummy_parameters = copy.deepcopy(parameters)
dummy_parameters.main_path = em.misc.create_dir(os.path.join(parameters.main_path, "dummy"))
dummy_parameters.n_steps = int(len(moldata.dihedrals) / parameters.batch_size)
dummy_parameters.summary_step = 1

e_map = em.AngleDihedralCartesianEncoderMapDummy(dummy_parameters, moldata)
e_map.train()
e_map.close()
e_map = None

costs = em.misc.read_from_log(os.path.join(dummy_parameters.main_path, "train"),
["cost/angle_cost", "cost/dihedral_cost", "cost/cartesian_cost"])
means = [np.mean(cost[:, 2]) for cost in costs]
parameters.angle_cost_reference = means[0]
parameters.dihedral_cost_reference = means[1]
parameters.cartesian_cost_reference = means[2]

np.savetxt(os.path.join(dummy_parameters.main_path, "adc_cost_means.txt"), np.array(means))

########################## Run Training #########################

#First run without C_alpha cost
parameters.n_steps = parameters.cartesian_cost_scale_soft_start[0]
e_map = em.AngleDihedralCartesianEncoderMap(parameters, moldata)
e_map.train()
e_map.close()
e_map = None

#Now we turn on C_alpha cost and continue the training run
parameters.n_steps = total_steps - parameters.cartesian_cost_scale_soft_start[0]
parameters.cartesian_cost_scale = 1
ckpt_path = os.path.join(parameters.main_path, "checkpoints", "step{}.ckpt"
.format(parameters.cartesian_cost_scale_soft_start[0]))

e_map = em.AngleDihedralCartesianEncoderMap(parameters, moldata, checkpoint_path=ckpt_path)
e_map.train()
e_map.close()
e_map = None


Analysis

import encodermap as em
import os
import numpy as np
import matplotlib.pyplot as plt
import MDAnalysis as md
import time
import sys

molname = "it_tremd"
run_id = 3
step = 50000
selection_for_alignment = "all"

main_path = "runs/{}/run{}".format(molname, run_id)

########################## Load data #########################

structure_path = "it_tremd.pdb"
trajectory_paths = ["wt_tremd.xtc"]

uni = md.Universe(structure_path, trajectory_paths)
selected_atoms = uni.select_atoms("backbone or name H or name O1 or (name CD and resname PRO)")
moldata = em.MolData(selected_atoms)

########################## Load parameters and checkpoint #########################

parameters = em.ADCParameters.load(os.path.join(main_path, "parameters.json"))

e_map = em.AngleDihedralCartesianEncoderMap(parameters, moldata,
checkpoint_path=os.path.join(main_path, "checkpoints",
"step{}.ckpt".format(step)), read_only=True)

########################## Project Data to map #########################

projected = e_map.encode(moldata.dihedrals)

########################## Plot histogram with path generator and lasso Select #########################

hist, xedges, yedges = np.histogram2d(projected[:, 0], projected[:, 1], bins=100)

#GENERATOR
fig1, axe1 = plt.subplots()

caxe = axe1.imshow(-np.log(hist.T), origin='low', extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]], aspect="auto")
cbar = fig1.colorbar(caxe)
cbar.set_label("-ln(p)", labelpad=0)
axe1.set_title("Path Generator")

#Interactive part
generator = em.plot.PathGenerateCartesians(axe1, e_map, moldata,
align_reference=moldata.sorted_atoms,
align_select=moldata.sorted_atoms)
#SELECTOR
fig2, axe2 = plt.subplots()

caxe = axe2.imshow(-np.log(hist.T), origin='low', extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]], aspect="auto")
cbar = fig2.colorbar(caxe)
cbar.set_label("-ln(p)", labelpad=0)
axe2.set_title("Selector")

#Interactive part
selector = em.plot.PathSelect(axe2, projected, moldata, e_map.p.main_path,
align_reference=moldata.sorted_atoms, align_select=sorted_atoms)

plt.show()


I've observed that the ValueError is raised when Encodermap has to generate the dihedrals for points in the histogram2D that don't correspond to points visited during the training trajectories (white points for instance).

With PathGenerator there are two cases:
1- Some points of the interpolation line falls into the snowy no man's land (again, white points of the histogram2D). This include the case where both initial and starting point of the line are located on visited conformations.
2- All points of the interpolation line when the line are located in the coloured landscape.

In both cases I only get a single generated.pdb file with only one structure (808 atoms) without the generated.xtc trajectory.
However, while for case 1 the "generated_dihedrals.npy", "generated_cartesians.npy" and "points.npy" files are empty
numpy.load("generated_cartesians.npy")
array([], shape=(0, 979, 3), dtype=float32)

numpy.load("generated_dihedrals.npy")
array([], shape=(0, 585), dtype=float32)

numpy.load("points.npy")
array([], shape=(0, 2), dtype=float64)

For case 2 the "generated_dihedrals.npy", "generated_cartesians.npy" and "points.npy" files are not empty
np.load("generated_cartesians.npy")
array([[[ 340.40887 , -156.71812 , 14.812061 ],
[ 339.13358 , -156.92285 , 15.56133 ],
[ 338.00388 , -156.04292 , 15.043763 ],
...,
[ 296.352 , -151.76828 , 10.23206 ],
[ 295.23303 , -152.34877 , 9.340768 ],
[ 294.29486 , -152.99358 , 9.83233 ]],

   [[ 340.4265   , -156.65323  ,   14.80023  ],
    [ 339.1524   , -156.86037  ,   15.5506735],
    [ 338.02197  , -155.98006  ,   15.035236 ],
    ...,
    [ 296.31827  , -151.76434  ,   10.181342 ],
    [ 295.2036   , -152.34918  ,    9.287112 ],
    [ 294.26462  , -152.99431  ,    9.776725 ]],

   [[ 340.4061   , -156.74065  ,   14.806906 ],
    [ 339.13022  , -156.94724  ,   15.554622 ],
    [ 338.00208  , -156.0638   ,   15.039015 ],
    ...,
    [ 296.3589   , -151.80229  ,   10.155857 ],
    [ 295.246    , -152.38972  ,    9.26109  ],
    [ 294.30872  , -153.03772  ,    9.750174 ]],

   ...,

   [[ 339.42545  , -157.90527  ,   13.330541 ],
    [ 338.18195  , -158.16643  ,   14.114656 ],
    [ 337.09344  , -157.13637  ,   13.843624 ],
    ...,
    [ 297.84274  , -153.943    ,    8.762965 ],
    [ 296.99573  , -154.91956  ,    7.918442 ],
    [ 296.14227  , -155.64691  ,    8.447772 ]],

   [[ 339.45627  , -157.77109  ,   13.277283 ],
    [ 338.21478  , -158.03618  ,   14.06283  ],
    [ 337.12326  , -157.00899  ,   13.79445  ],
    ...,
    [ 297.8537   , -153.892    ,    8.834031 ],
    [ 297.005    , -154.87134  ,    7.9943395],
    [ 296.1506   , -155.5949   ,    8.527327 ]],

   [[ 339.44815  , -157.74577  ,   13.336235 ],
    [ 338.2058   , -158.01183  ,   14.119926 ],
    [ 337.11487  , -156.98352  ,   13.851964 ],
    ...,
    [ 297.88385  , -153.947    ,    8.798447 ],
    [ 297.0368   , -154.92967  ,    7.9618187],
    [ 296.18332  , -155.6528   ,    8.4968605]]], dtype=float32)

np.load("generated_dihedrals.npy")
array([[ 2.9005954, 3.1264963, -1.4093626, ..., 2.534475 , 3.133247 ,
-1.819616 ],
[ 2.9006398, 3.126515 , -1.4091741, ..., 2.5345163, 3.1332383,
-1.8195987],
[ 2.9006844, 3.1265335, -1.4089855, ..., 2.534557 , 3.1332295,
-1.8195815],
...,
[ 2.9099605, 3.1298342, -1.3737955, ..., 2.534987 , 3.1308818,
-1.815484 ],
[ 2.910013 , 3.1298494, -1.3736246, ..., 2.5349364, 3.1308665,
-1.8154594],
[ 2.9100652, 3.1298647, -1.3734541, ..., 2.534885 , 3.130851 ,
-1.8154347]], dtype=float32)

np.load("points.npy")
array([[-0.96987992, 0.67838148],
[-0.9692479 , 0.67774558],
[-0.96861588, 0.67710968],
[-0.96798386, 0.67647378],
[-0.96735184, 0.67583789],
[-0.96671982, 0.67520199],
[-0.9660878 , 0.67456609],
[-0.96545578, 0.67393019],
[-0.96482376, 0.67329429],
[-0.96419174, 0.67265839],
[-0.96355972, 0.67202249],
[-0.9629277 , 0.67138659],
[-0.96229568, 0.67075069],
[-0.96166366, 0.6701148 ],
[-0.96103163, 0.6694789 ],
[-0.96039961, 0.668843 ],
[-0.95976759, 0.6682071 ],
[-0.95913557, 0.6675712 ],
[-0.95850355, 0.6669353 ],
[-0.95787153, 0.6662994 ],
[-0.95723951, 0.6656635 ],
...,
[-0.84916405, 0.5569248 ],
[-0.84853203, 0.5562889 ],
[-0.84790001, 0.55565301],
[-0.84726799, 0.55501711],
[-0.84663597, 0.55438121],
[-0.84600395, 0.55374531],
[-0.84537193, 0.55310941],
[-0.84473991, 0.55247351],
[-0.84410789, 0.55183761]])

With PathSelect instead I obtain both the generated.pdb and generated.xtc files and the ValueError is not raised. I think that this is because PathSelect only selects the points within the chosen area that are visited during the trajectory.
indices = np.nonzero(Path(points).contains_points(self.projected))[0]
self.axe.scatter(self.projected[indices, 0], self.projected[indices, 1])

In fact, if I try to modify the function by trying to select all points within the area, I get the error "ValueError: The provided value for n_atoms (808) does not match the shape of the coordinate array (1102)".

I think that there is a problem with the coordinate array when
angles, dihedrals, cartesians = self.autoencoder.generate(points)
is called.

What can I do to fix it?

I'm not sure what's wrong. Maybe (if you have not already) you could run my original example files with the Diubiquitin data. (you can get the trajectories from https://www.kaggle.com/andrejberg/md-simulations-of-linear-ubiquitin-dimers)

That way we could find out whether there is a problem in your installation (e.g. some combination of tenserflow, encodermap and numpy versions that don't work together) or whether it is something related to the changes you made or your data.

I had the same idea, so yesterday I've run the original diubiquitin data with the same scripts that I've posted before and it hasn't caused any problem. I've obtained the same map that you published in the paper and I was able to correctly use Path Generator and get the trajectory file.
At this point, I think that the issue could be related to the simulation data that I have used to train Encodermap.

Hello @TobiasLe. What is the code that you used to colour differently the two distinct simulations in Fig.10 of your paper on Encodermap(II) (traces of two Ssa1 simulations in the two-dimensional EncoderMap)?

Basic numpy slicing will do the trick. Something like projected[0 : len_traj0] will get you the data of the first trajectory, and I'm sure you can figure out how you need to set the indices for the other trajectories.

Hello again @TobiasLe! I'm still facing the ValueError problem. I've tried many different approaches, and none of them has solved the issue.

Again, the error happens when the decoder tries to backmap the points I've selected on the 2D map into proteins.
I think that the sequence of events goes like this:

  1. I click on the 2D map to select the point(s) and I press Enter.
  2. The function use_points is called within the PathGenerateCartesians class.
  3. The decoder generates the cartesians angles, dihedrals, cartesians = self.autoencoder.generate(points)
  4. The generated cartesians are fed into the write function of MolData: output_universe.load_new(coordinates, format=MemoryReader)
  5. ValueError: The provided value for n_atoms (808) does not match the shape of the coordinate array (979)

I've observed that the dimension of the cartesian ndarray that is fed to write has dimensions (200, 979, 3) instead of (200, 808, 3). During the whole process I've checked that the number of atoms saved in the MolData object moldata is 808.
The program writes the generated_dihedrals.npy, generated_cartesians.npy, points.npy and generated.pdb files.
The generated.pdb file only contains one structure, equal to the structure used as input for the training. points.npy has shape (200,2).
If I check the generated_cartesians.npy file I see that there are three columns, but there are coordinates for 979 atoms instead of 808.
The number of generated dihedrals in generated_dihedrals.npy is instead equal to the number of dihedrals saved in moldata (585).

I cannot understand where this discrepancy stems from: the number of atoms in dihedrals is maintained, otherwise there won't be 585 dihedrals both in the input and the output, but the number of atoms in the cartesian coordinates is somewhat different.

I've tried to use different variations of the input trajectory as input for the training, to check what could be the problem:

  • Simulation A of protein 1 (xtc)
  • Simulation B of protein 1 (xtc)
  • Multiple simulations (A+B) of protein 1 (xtc)
  • Simulation A of protein 1 (pdb)
  • Simulation A of protein 1 (xtc) without centering the trajectory (removing PBCs)
  • Simulation A of protein 1 (xtc) without centering the trajectory and without removing PBCs
  • Simulation of a different protein (8150 atoms)

I've also tried different values of sigma in cartesian_dist_sig_parameters in order to match the mapping in the diubiquitin example.

  • Is it possible that Encodermap has issues when dealing with a number of atoms greater than a certain threshold (ex. 1500 for your diubiquitin test)? The protein that I want to use Encodermap for has 2992 atoms.
  • Is it possible that there should be a minimum number of trajectory frames that should be used to be able to correctly generate new conformations with the decoder part of Encodermap? The diubiquitine trajectories had altogether 60.000 frames for example. All the trajectories I've used for testing had around 5000 frames.
  • What kind of post-processing did the diubiquitin trajectories received before being submitted to the Encodermap training?
  • What can I try to solve this issue?

Thank you again.

@Monte95 I'm sorry you are experiencing this kind of problems. This part of the code has not been used by lots of people.

I think I figured out what is causing the problem. EncoderMap generates atom coordinates for 979 atoms but the mdanalysis universe of moldata (used to write the trajectory) only contains 808 atoms.
=> there is 171 atoms missing.
That is equal to your number of residues (196) minus the number of prolines in the sequence (25).

Long story short: In contrast to the Diubiquitin example, your files probably do not contain hydrogen atoms, but my code generates the coordinates including the hydrogen atoms of the backbone N-atoms.

selected_atoms = uni.select_atoms("backbone or name H or name O1 or (name CD and resname PRO)")

This was supposed to also select these hydrogen atoms but this doesn't work if they are not present in the loaded file.
Can you try with a file that includes these hydrogen atoms? Make sure that the uni.select_atoms command above results in the 979 atoms including the H-atoms at the nitrogens.

I will surely try it right away.
Question: if I simply use selected_atoms = uni.select_atoms("backbone or name O1 or (name CD and resname PRO)"), without the H atoms at the hydrogen, does encodermap generate them anyway?

Yes they will be generated anyway. I should add some checks there.

Finally it worked!!!!

I've used two different approaches, and they both work:

  1. I've changed the selection to match the atom names in my structure pdb file: selected_atoms = uni.select_atoms("backbone or name HN name OT1 or (name CD and resname PRO)"). It generates the generated.pdb and generated.xtc files, but the protein is somewhat unfolded.
  2. I've changed the atom names in the input files from HN to H and OT1 to O1 to match the atom names used in the standard command selected_atoms = uni.select_atoms("backbone or name H name O1 or (name CD and resname PRO)") and it works just fine. I will stick with this approach.

Thank you a lot for your help and your time @TobiasLe!

@Monte95 Great! You're welcome!