openstf/stf

Device unit subscriber dies after 20 or 30 minutes if device is kept idle

Closed this issue · 7 comments

I have a working setup of STF in production.Currently, I am having a problem which is bugging me for couple of days and want to know if somebody else is having same.

The problem is, if I keep my device idle for 20 or 30 minutes, then I get totally blank screen after click the use button on device. Below are the screen shots.

STF Device list view, as it can be seen device are usable
screen shot 2015-10-07 at 13 25 59

This is the view I get when I click use button for some (expired) device.
screen shot 2015-10-07 at 13 46 30

I did troubleshooting for this and found out that device subscriber is not receiving message if it is kept idle for some minutes. So basically "group.invite" or GroupMessage from app side is not being received by device unit.

I confirmed this by writing a simple zeromq subscriber and subscribed it with dev-side publisher. Script looks like this

var zmq = require('zmq');
var util = require('util')

var sub = zmq.socket('sub');

// subscribing to device channel
sub.subscribe("QsoRFrXSROZTpgwiqmTxlpDoelg=")

sub.on("message", function(channel, data) {
  console.log('got message on channel ' + channel + " data is : " + data.toString() + "\n")
})

sub.connect("tcp://DEVSIDE_IP:7250")

If I press the "Use" button now, then I can see subscriber in the script is receiving messages. Then I kept (expired) device idle for around 30 minutes and not surprisingly subscriber in the script also stopped receiving messages. So basically, subscriber is dying if kept idle.

I did research about zeromq's subscriber dying problem and found that it may because of various reason such as TCP timeout or network problem between machines. I checked TCP timeout in my provider machine, it is 2hrs so most probably this is not the problem.

One more additional thing, I am using mac-os for my provider instead of linux machine. I know STF does not support mac-osx for production but currently I am using my available resources. All the other units are running in CentOS 7.0

I want to know, if somebody has faced subscriber dying problem or can suggest me reason why it is dying or not receiving messages? I will be glad if someone can also suggest work around for this problem, currently I am thinking of sending heartbeats to all the device subscriber to keep them alive.

Thanks,

Haven't seen this problem before, but like you said we only use OS X for development. Please keep us updated if you find a solution.

Currently experiencing the same, with both my providers. A Raspbian (raspberry pi) provider and a Ubuntu 15 server. If left idle, the devices are still visible as usable, but no data is received if you try to use them, and they are not marked as in use (no "stop using" button visible, it's like I never tried to connect).
So far my workaround has been to periodically restart the stf-provider service.

Well, I did some experiment by creating different publisher and subscriber on different machines, current results are as follow:

  1. If publisher and subscriber on same host (same machine), subscriber will continue receiving message (as in case of stf local).
  2. If publisher and subscriber are on different host, subscriber discontinue receiving messages if kept idle.

Still trying to figure out cause behind the problem. And its frustrating since I have to wait 30minutes to reproduce the behaviour.

Related Issues: http://stackoverflow.com/questions/28252054/zmq-socket-not-working-after-a-period-of-time

I think, I figured it out.

As mentioned in ZeroMQ guide, http://zguide.zeromq.org/page:all#Shrugging-It-Off

If we use a TCP connection that stays silent for a long while, it will, in some networks, just die. Sending something (technically, a "keep-alive" more than a heartbeat), will keep the network alive.

Indeed the problem was in TCP connection. By default, ZeroMQ uses OS settings for TCP_KEEPALIVE option http://api.zeromq.org/3-2:zmq-setsockopt

Override SO_KEEPALIVE socket option(where supported by OS). The default value of -1 means to skip any overrides and leave it to OS default.

And for OS X or most of the linux, the default is false. So, turning it on worked for me.

sudo -w sysctl net.inet.tcp.always_keepalive=1

In addtion, the default wait time for sending pings is 2hrs, so it is also need to be lowered. I made it 10mins.

sudo -w sysctl net.inet.tcp.keepidle=600000

This fixed my problem, but I still feel that it is not a good solution. It will be good if we can handle it from application itself. ZeroMQ provides options where we can set these option while creating zmq sockets. So, good solution will be to set these options on all the subscribers and dealers of stf, some thing like below

sub.setsockopt(zmq.ZMQ_TCP_KEEPALIVE, 1)
sub.setsockopt(zmq.ZMQ_TCP_KEEPALIVE_IDLE, 600000)

@michaelmadsen
Can you check if this solution works for you. I don't have a linux provider.

Great, thanks for doing the research! I can say that we have never had this exact same problem, but then again we've only used CoreOS so far, which I suppose may have different defaults. It would be great if others could verify that your solution works for them.

Furthermore, it sounds like it would be good to add the options to all ZMQ sockets, not just the subscriptions. I will look into adding your suggested patches into the application code, but it may take a while since I am somewhat busy right now. I would be happy to accept a pull request, though.

Yeah sure, I will write PR for this. Reason I did not do it before is that I did not test this on linux provider.

For people who are interested in why TCP connection dies on some network, this will help.

The other useful goal of keepalive is to prevent inactivity from disconnecting the channel. It's a very common issue, when you are behind a NAT proxy or a firewall, to be disconnected without a reason. This behavior is caused by the connection tracking procedures implemented in proxies and firewalls, which keep track of all connections that pass through them. Because of the physical limits of these machines, they can only keep a finite number of connections in their memory. The most common and logical policy is to keep newest connections and to discard old and inactive connections first.

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html

Hi @vbanthia ,
Great work !

I started using STF now and it's very useful for me and I am loving it.
If you could help me in answering for couple of questions please,
I am using Mac 10.12.4

1)If publisher and subscriber are on different host, subscriber discontinue receiving messages if kept idle.
Say I forgot to click on stop using device before session time , Will that device will be busy for others until I click stop using or session time-out ?
So as u said is there only one chance that we can reduce/set session timeout(By TCP) time for device if its idle or not used for a time.

2)How do I re-name my devices in STF ?

ATM we cannot get device name , instead its getting model name. Is there any ADB command to get device name.

screen shot 2017-04-06 at 15 09 31

Thanks
SK