"atom_types" not included in pytest and pytest not pass
Huni-ML opened this issue · 7 comments
Describe the bug
There is a bug in tests/conftest.py that typemapper is not set for atomic_batch, which will cause atom_types are missed in data.
Also, pytest throws several failure as following. It seems that the numeric_gradient, partial_forces and normalized_basis tests are not passed.
Meanwhile, TorchScript interpreter failed for float64.
Reproduce
Just do pytest
FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config0-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config1-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config2-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config3-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_partial_forces[float32-device0] - AssertionError: assert False
FAILED tests/nn/test_norm_basis.py::test_normalized_basis[float32-0.2] - assert tensor(False)
FAILED tests/nn/test_norm_basis.py::test_normalized_basis[float32-1.0] - assert tensor(False)
FAILED tests/model/test_allegro.py::TestWorkflow::test_weight_init[float64-device0-config0-model0] - RuntimeError: The following operation failed in the TorchScript interpreter.
FAILED tests/model/test_allegro.py::TestWorkflow::test_weight_init[float64-device0-config0-model1] - RuntimeError: The following operation failed in the TorchScript interpreter.
FAILED tests/model/test_allegro.py::TestWorkflow::test_jit[float64-device0-config0-model0] - RuntimeError: The following operation failed in the TorchScript interpreter.
FAILED tests/model/test_allegro.py::TestWorkflow::test_jit[float64-device0-config0-model1] - RuntimeError: The following operation failed in the TorchScript interpreter.
....
Environment (please complete the following information):
OS: centOS 7
python version: 3.9.12
python environment (commands are given for python interpreter):
nequip version: 0.5.5
e3nn version: 0.5.0
pytorch version: 1.11.0+cu113
Hi @yuhongyu0721 ,
Thanks very much for pointing out this issue, and for your interest in our code!
I believe I've resolved this on the develop
branch and the test suite should now run correctly--- please note that it requires also the latest develop
branch of the nequip
repository. Can you pull down these latest commits on the develop
branches and let me know if the tests succeed for you now? Thanks!
Thanks a lot for your reply and great contributions of your work to this field!
I have updated nequip and allegro with the latest develop
branch. However, several bugs are thrown.
ERROR tests/model/test_allegro.py::TestAllegro::test_numeric_gradient[float64-device0-config0-base_config2]
E RuntimeError: Failed to build object with prefix `edge_eng` using builder `ScalarMLP`
The initialization of edge_eng
(ScalarMLP
) goes wrong though I see edge_eng_mlp_latent_dimensions
in the config. It's confusing for me as edge_eng should accept edge_eng_mlp_latent_dimensions
as input of mlp_latent_dimensions.
Do tests go right in your environment? Thanks!
Hm the tests do pass in my environment... let me check again.
If you run pytest --lf
to just rerun the one failed test, can you send the full backtrace?
The tests do pass in my environment. If you have a full backtrace I can try to see what the issue is.
Tests pass! I reinstall nequip
and allegro
then everything goes right. Thank you!
Also I find a possible bug in allegro/nn/_allegro.py
line 337
if layer_idx == 0:
# at the first layer, we have no invariants from previous TPs
self.latents.append(
two_body_latent(
mlp_input_dimension=(
(
# Node invariants for center and neighbor (chemistry)
2 * self.irreps_in[self.node_invariant_field].num_irreps
# Plus edge invariants for the edge (radius).
+ self.irreps_in[self.edge_invariant_field].num_irreps
)
),
mlp_output_dimension=None,
**two_body_latent_kwargs,
)
)
else:
self.latents.append(
latent(
mlp_input_dimension=(
(
# the embedded latent invariants from the previous layer(s)
self.latents[-1].out_features
# and the invariants extracted from the last layer's TP:
+ env_embed_multiplicity * n_scalar_outs ### Should be self._n_scalar_outs[layer_idx - 1]
)
),
mlp_output_dimension=None,
)
)
When initialize latent layer, the scalar outs of last layer are used. As a result, I think that n_scalar_outs
should be replaced by self._n_scalar_outs[layer_idx - 1]
.
There may raise bug when n_scalar_outs differs from each other in different layer.
Though now it goes right as each layer share the same number of n_scalar_outs.
Can you please check this possible bug? Thank you.
Great! Glad to hear.
Yes, you're right—this is technically speaking a bug, and one that I have a fix for on a branch somewhere... I'll try to get that merged soon. But as you observed, for the Allegro architecture, the env_embed_irreps
—and thus n_scalar_outs
, as long as the initial features also have the same irreps, which they do unless you customize the input features—are the same at every layer, and so it doesn't matter.
Good catch 😁 and thanks for the reminder on this!
Thanks for your reply. Great work!