harvard-acc/smaug

Flaky tests caused by floating point precision loss

xyzsam opened this issue · 2 comments

https://travis-ci.org/github/harvard-acc/smaug/builds/702716460 failed because one floating point element was off by 0.001 (which is greater than the required 3 decimal places of accuracy):

======================================================================
FAIL: test_bahdanau_attention (__main__.AttentionTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./build/smaug/python/ops/attention_test.py", line 67, in test_bahdanau_attention
    self.runAndValidate(graph, tf_attention)
  File "/home/travis/build/harvard-acc/smaug/smaug/python/smaug_test.py", line 85, in runAndValidate
    assert_array_almost_equal(expected_output, sg_output, decimal=3)
  File "/home/travis/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 1044, in assert_array_almost_equal
    precision=decimal)
  File "/home/travis/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 3 decimals
Mismatched elements: 1 / 64 (1.56%)
Max absolute difference: 0.001953
Max relative difference: 0.02615
 x: array([[-0.566, -0.007, -0.665, -0.778, -0.135, -0.739, -0.725, -0.221,
        -0.307, -0.094, -0.493,  0.492,  0.282, -0.381, -0.438, -0.242,
         0.327, -0.43 ,  0.454, -0.639,  0.295, -0.207,  1.404, -0.821,...
 y: array([[-0.566, -0.008, -0.665, -0.779, -0.135, -0.74 , -0.725, -0.222,
        -0.307, -0.094, -0.493,  0.491,  0.283, -0.38 , -0.438, -0.241,
         0.327, -0.431,  0.453, -0.639,  0.296, -0.207,  1.402, -0.82 ,...

Difference is between -0.221 and -0.222. This test did pass before though, so it's worth tracking down where the flakiness is coming from. If it's code we own, it should be possible to always reproduce the same exact numbers; if the difference is coming from TF code, then it's going to be harder to debug, in which case we should just loosen the required accuracy.

Sure I will look into this. Not sure if the accuracy loss is from some slight implementation differences in softmax. And because we use random data in these tests, it's not reproducible for every run, which is why it passed before.

Close because it's fixed by PR #19.