Flaky tests caused by floating point precision loss
xyzsam opened this issue · 2 comments
https://travis-ci.org/github/harvard-acc/smaug/builds/702716460 failed because one floating point element was off by 0.001 (which is greater than the required 3 decimal places of accuracy):
======================================================================
FAIL: test_bahdanau_attention (__main__.AttentionTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "./build/smaug/python/ops/attention_test.py", line 67, in test_bahdanau_attention
self.runAndValidate(graph, tf_attention)
File "/home/travis/build/harvard-acc/smaug/smaug/python/smaug_test.py", line 85, in runAndValidate
assert_array_almost_equal(expected_output, sg_output, decimal=3)
File "/home/travis/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 1044, in assert_array_almost_equal
precision=decimal)
File "/home/travis/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not almost equal to 3 decimals
Mismatched elements: 1 / 64 (1.56%)
Max absolute difference: 0.001953
Max relative difference: 0.02615
x: array([[-0.566, -0.007, -0.665, -0.778, -0.135, -0.739, -0.725, -0.221,
-0.307, -0.094, -0.493, 0.492, 0.282, -0.381, -0.438, -0.242,
0.327, -0.43 , 0.454, -0.639, 0.295, -0.207, 1.404, -0.821,...
y: array([[-0.566, -0.008, -0.665, -0.779, -0.135, -0.74 , -0.725, -0.222,
-0.307, -0.094, -0.493, 0.491, 0.283, -0.38 , -0.438, -0.241,
0.327, -0.431, 0.453, -0.639, 0.296, -0.207, 1.402, -0.82 ,...
Difference is between -0.221 and -0.222. This test did pass before though, so it's worth tracking down where the flakiness is coming from. If it's code we own, it should be possible to always reproduce the same exact numbers; if the difference is coming from TF code, then it's going to be harder to debug, in which case we should just loosen the required accuracy.
Sure I will look into this. Not sure if the accuracy loss is from some slight implementation differences in softmax. And because we use random data in these tests, it's not reproducible for every run, which is why it passed before.
Close because it's fixed by PR #19.