sipcapture/heplify-server

Recurrence of Issue with Reversed INVITE / 100 Trying

joshelson opened this issue ยท 23 comments

Using Homer7 with Heplify via Docker, sending from our SBC infrastructure via HEP, and seeing a recurrence of an old issue where seem to be lacking enough timer precision to separate the time of arrival for INVITE and 100 trying messages. See below:

image

This only occurs 25% of the time or so, but when it does it's very confusing to our poor support teams when they look at the call ladders.

Do we have a way to add timer precision? Or simply not allow a 100/Trying to be the first element in a call ladder? Or should we simply move our Homer server further away from our SBCs? :)

Any help appreciated!

if you click on each message, can you please compare the timestamps seconds+useconds ? I see that seconds and milliseconds are same, but more interesting to understand if the microseconds are different. Also please switch heplify from pcap to afpacket mode - this will take the timestamps from kernel.

It appears that they are also reversed in the PCAP:

100/Trying Arrival Time: Apr 28, 2021 11:33:51.195982000 MDT

Invite Arrival Time: Apr 28, 2021 11:33:51.196122000 MDT

This is a bit of an aside, but is it possible to easily set af_packet from the environment variables for a docker installation?

@lmangani do we have a param for docker ?

It appears that they are also reversed in the PCAP:

100/Trying Arrival Time: Apr 28, 2021 11:33:51.195982000 MDT

Invite Arrival Time: Apr 28, 2021 11:33:51.196122000 MDT

This is a bit of an aside, but is it possible to easily set af_packet from the environment variables for a docker installation?

but these are different messages...

So might have grabbed a bad example. Here's one we just created.

This is the 100/trying:

image

This is the initial invite to the SBC:

image

These show reversed even on the usecond portion where the INVITE shows received after the 100.

Looking at this a little more, we're only using heplify-server in this environment and not the heplify capture client. Should we expect af_packet to work and/or have any effect with heplify-server or only with the heplify agent itself?

We are using a commercial SBC from Sansay which has support for HEP built-in. It's possible they're sending slightly out of order, but this might be tricky for us to verify for a bit.

@joshelson ok, this is something different, you said you use heplify (the client). In case of Sansay, please capture HEP packets (port 9063) and install the wireshark hep lua plugin and check the timestamp inside. Probably they set timestamp on send and not on receive. https://github.com/sipcapture/hep-wireshark

So I just captured the HEP incoming to the heplify-server and there does appear to be a problem in heplify-server causing the issues here.

The SBC is properly writing the order of the usecond frames in the HEP packets, but the PCAP retrieved from the other side shows the initial two packets reversed.

Here's the HEP capture prior to processing by heplify-server: https://drive.google.com/file/d/12C5txdqHUwQSMSkQTd9JAHe-gGdUdL08/view?usp=sharing

And here's the output of the PCAP post-processing from Homer: https://drive.google.com/file/d/1lxxtT8clgeSAjdbrPs0eJOaT-TfzdpXG/view?usp=sharing

ok, found. Weird stuff is here:

bb.WriteString(strconv.FormatUint(uint64(h.Timestamp.Unix()), 10))
bb.WriteString(`,"timeUseconds":`)
bb.WriteString(strconv.FormatUint(uint64(h.Timestamp.Nanosecond()/1000), 10))
bb.WriteString(`,"payloadType":`)

@negbie any reason why ?

@negbie and you understand this is not good :-) Let me recheck why it forces to local time.

Read commit description text.

I highly recommend reading this https://golang.org/pkg/time/

I think the problem could be due to line 119 in the decoder. I would turn on debug level and see if that condition is problematic. I added that more as a workaround and not as final solution at that time.

i think the problem is here: https://github.com/sipcapture/heplify-server/blob/master/decoder/decoder.go#L118-L119

if time has been not synced you force to local time and it makes this effect. Need disable d.Sub and do only null check

See comment above.

Good now you understood.

I understood that we couldn't force to local time and break all call flows if times between two nodes are desync for milliseconds. :-) I am making a PR! Thanks for collaboration!

#469 @negbie @lmangani Please review!

@joshelson we are building new packages and docker containers with version 1.55 Can you please update your system and check if the issue has been fixed ? Thank you for the detailed bug report! I wish everybody would provide such good information!

This appears to resolve the issue! Thank you guys so much for the responiveness! It's working great now.