microsoft/DeepSpeed

[BUG] inference ops unit tests are failing

oelayan7 opened this issue · 4 comments

It was seen that tests under unit/ops/transformer/inference are not being run in any CI job.
Some tests are failing in that directory (examples will be provided below), I have talked to @loadams about it and he tried running them on a V100 setup.
The results he got were 440 failed, 2598 passed, 8 skipped for those tests.

Example for the tests we saw them failing were:

  • unit/ops/transformer/inference/test_bias_geglu.py::test_gated_silu and the failure was on different results than the reference.
  • unit/ops/transformer/inference/test_layer_norm.py::test_layer_norm and the failure was Feature '.bf16' requires .target sm_80 or higher

A hint that could help, those tests are permutated over the supported dtypes, and the failures are always in dtype2 (I assume it is bf16).

test_layer_norm_residual, test_residual_add, test_bias_geglu, test_moe_residual_matmul, test_pre_norm, test_rms_norm

Thanks @oelayan7 - updating this, when we currently run off of the master branch, I see the following:

 68 failed, 1064 passed, 2545 skipped, 2103 deselected, 4 warnings in 226.45s (0:03:46) 

Will create and link a PR that reproduces this.