mir-group/allegro

"atom_types" not included in pytest and pytest not pass

Huni-ML opened this issue · 7 comments

Describe the bug

There is a bug in tests/conftest.py that typemapper is not set for atomic_batch, which will cause atom_types are missed in data.
Also, pytest throws several failure as following. It seems that the numeric_gradient, partial_forces and normalized_basis tests are not passed.
Meanwhile, TorchScript interpreter failed for float64.

Reproduce

Just do pytest

FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config0-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config1-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config2-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_numeric_gradient[float32-config3-device0] - AssertionError: assert (tensor(False, device='cuda:0') or tensor(False, device='cuda:0'))
FAILED tests/model/test_allegro.py::TestGradient::test_partial_forces[float32-device0] - AssertionError: assert False
FAILED tests/nn/test_norm_basis.py::test_normalized_basis[float32-0.2] - assert tensor(False)
FAILED tests/nn/test_norm_basis.py::test_normalized_basis[float32-1.0] - assert tensor(False)
FAILED tests/model/test_allegro.py::TestWorkflow::test_weight_init[float64-device0-config0-model0] - RuntimeError: The following operation failed in the TorchScript interpreter.
FAILED tests/model/test_allegro.py::TestWorkflow::test_weight_init[float64-device0-config0-model1] - RuntimeError: The following operation failed in the TorchScript interpreter.
FAILED tests/model/test_allegro.py::TestWorkflow::test_jit[float64-device0-config0-model0] - RuntimeError: The following operation failed in the TorchScript interpreter.
FAILED tests/model/test_allegro.py::TestWorkflow::test_jit[float64-device0-config0-model1] - RuntimeError: The following operation failed in the TorchScript interpreter.
....

Environment (please complete the following information):

OS: centOS 7
python version: 3.9.12
python environment (commands are given for python interpreter):
nequip version: 0.5.5
e3nn version: 0.5.0
pytorch version: 1.11.0+cu113

Hi @yuhongyu0721 ,

Thanks very much for pointing out this issue, and for your interest in our code!

I believe I've resolved this on the develop branch and the test suite should now run correctly--- please note that it requires also the latest develop branch of the nequip repository. Can you pull down these latest commits on the develop branches and let me know if the tests succeed for you now? Thanks!

Thanks a lot for your reply and great contributions of your work to this field!
I have updated nequip and allegro with the latest develop branch. However, several bugs are thrown.

ERROR tests/model/test_allegro.py::TestAllegro::test_numeric_gradient[float64-device0-config0-base_config2]
E           RuntimeError: Failed to build object with prefix `edge_eng` using builder `ScalarMLP`

The initialization of edge_eng (ScalarMLP) goes wrong though I see edge_eng_mlp_latent_dimensions in the config. It's confusing for me as edge_eng should accept edge_eng_mlp_latent_dimensions as input of mlp_latent_dimensions.

Do tests go right in your environment? Thanks!

Hm the tests do pass in my environment... let me check again.

If you run pytest --lf to just rerun the one failed test, can you send the full backtrace?

The tests do pass in my environment. If you have a full backtrace I can try to see what the issue is.

Tests pass! I reinstall nequip and allegro then everything goes right. Thank you!

Also I find a possible bug in allegro/nn/_allegro.py line 337

            if layer_idx == 0:
                # at the first layer, we have no invariants from previous TPs
                self.latents.append(
                    two_body_latent(
                        mlp_input_dimension=(
                            (
                                # Node invariants for center and neighbor (chemistry)
                                2 * self.irreps_in[self.node_invariant_field].num_irreps
                                # Plus edge invariants for the edge (radius).
                                + self.irreps_in[self.edge_invariant_field].num_irreps
                            )
                        ),
                        mlp_output_dimension=None,
                        **two_body_latent_kwargs,
                    )
                )
            else:
                self.latents.append(
                    latent(
                        mlp_input_dimension=(
                            (
                                # the embedded latent invariants from the previous layer(s)
                                self.latents[-1].out_features
                                # and the invariants extracted from the last layer's TP:
                                + env_embed_multiplicity * n_scalar_outs  ### Should be self._n_scalar_outs[layer_idx - 1]
                            )
                        ),
                        mlp_output_dimension=None,
                    )
                )

When initialize latent layer, the scalar outs of last layer are used. As a result, I think that n_scalar_outs should be replaced by self._n_scalar_outs[layer_idx - 1].
There may raise bug when n_scalar_outs differs from each other in different layer.
Though now it goes right as each layer share the same number of n_scalar_outs.

Can you please check this possible bug? Thank you.

Great! Glad to hear.

Yes, you're right—this is technically speaking a bug, and one that I have a fix for on a branch somewhere... I'll try to get that merged soon. But as you observed, for the Allegro architecture, the env_embed_irreps—and thus n_scalar_outs, as long as the initial features also have the same irreps, which they do unless you customize the input features—are the same at every layer, and so it doesn't matter.

Good catch 😁 and thanks for the reminder on this!

Thanks for your reply. Great work!