heroku/vegur

vegur_roundtrip_SUITE:large_chunked_request_response_interrupt has non-deterministic failures

evanmcc opened this issue · 5 comments

=== Ended at 2014-12-18 10:08:48
=== location [{vegur_roundtrip_SUITE,recv_until_close,2017},
              {vegur_roundtrip_SUITE,large_chunked_request_response_interrupt,1907},
              {test_server,ts_tc,1415},
              {test_server,run_test_case_eval1,1028},
              {test_server,run_test_case_eval,976}]
=== reason = timeout
  in function  vegur_roundtrip_SUITE:recv_until_close/1 (vegur_roundtrip_SUITE.erl, line 2017)
  in call from vegur_roundtrip_SUITE:large_chunked_request_response_interrupt/1 (vegur_roundtrip_SUITE.erl, line 1907)
  in call from test_server:ts_tc/3 (test_server.erl, line 1415)
  in call from test_server:run_test_case_eval1/6 (test_server.erl, line 1028)
  in call from test_server:run_test_case_eval/9 (test_server.erl, line 976)

ignore the bogus line numbers and error reason, I have some debugging code in the test. The gen_tcp:recv/3 is failing eventually at Timeout = 100, 300, and 10000. I didn't try anything higher. I don't have good counts on how often this happens, but in all cases less than 5 minutes. just do:

 while [ $? -eq 0 ]; do ct_run -dir test/ -logdir logs -pa ebin -pa deps/*/ebin; done

and you'll get a failure before too long.

edited output of erlang:port_info(Port) on the port as it times out.

[{name,"tcp_inet"},
 {links,[<0.4081.2>]},
 {id,10729},
 {connected,<0.4081.2>},
 {input,0},
 {output,12000}, <------
 {os_pid,undefined}]

note that I also tried:

-    {ok, Client} = gen_tcp:connect(IP, Port, [{active,false},list],1000),
+    {ok, Client} = gen_tcp:connect(IP, Port, [{active,false},list,{sndbuf,100000},{recbuf,100000}],1000),

but got the same output when it failed.

also saw an identical failure in: vegur_roundtrip_SUITE:large_close_request_response_interrupt/1

ferd commented

I'm wondering if this isn't just bad TCP stacks falling into weird states here and there. Running tests on localhost and on travis sometimes would yield entirely different ways to terminate connections.

could well be. I'd feel more comfortable if this failed less often though, so mostly I am looking for ameliorations to make the failure < 1% of the time, ideally much less.

ferd commented

I think this has been fixed while reworking the interruption detection and semantics. Marking as closed, will reopen if we see it happen again.