remp2020/remp

BEAM client too aggressive

j-norwood-young opened this issue · 6 comments

The BEAM client fires every 5 seconds, using about 1.3kb per request. If a page is open for an hour, that would use up about 8mb of data. If someone leaves a page open for 24 hours in South Africa, it would cost a day's wages.

Possible solutions:

  • Use websockets where possible. You can detect the termination of the websocket to fire the time spent update.
  • Reduce the amount of data sent per request. After the first request, you really just need a pageview ID. You could even remove the time_spent - the server could calculate the time spent. (This would also close the vulnerability of being able to manually manipulate time spent.)
  • Make the interval a configurable variable.

Those requests are weirdly big. Considering the request that's currently being generated:

{"action":"timespent","timespent":{"seconds":30,"unload":false},"system":{"property_token":"1a8feb16-3e30-4f9b-bf74-20037ea8505a","time":"2019-07-23T11:55:38.399Z"},"user":{"id":"92363","browser_id":"a1a0cb38-4c7a-49f5-b1d6-3246c5f4ae73","subscriber":true,"url":"https://dennikn.sk/","referer":"","adblock":false,"window_height":1050,"window_width":1920,"cookies":true,"websockets":true,"source":{},"remp_session_id":"72c73e09-ed47-4259-b35f-4599d400ba41","remp_pageview_id":"dfa61927-d5c7-4084-bbf2-7a367fe30cf0"}}

It's 516 bytes uncompressed and developer tools report this as 171B sent over network (probably gzip compression). Would you share your requests so we can check why they're so big?

About why it's this way.

  • The reason why this is sent periodically is to maintain information if user is still reading the article. The "concurrents" metric works with this and filters users with activity within last N minutes (unless it's disabled and "pageviews" are used as a source -https://github.com/remp2020/remp/blob/master/Docker/telegraf/telegraf.conf#L80).
  • The reason why we don't just send the pageview ID but more fields is to have mostly-used fields tracked with the same index as timespent. It makes querying data much easier as you don't need to combine them client-side.

Having things designed this way was a simplicity tradeoff - we actually didn't wanted for Tracker to contain logic or maintain information about data/pageviews being tracked. Both of that would be necessary if we wanted to use websockets or calculate time spent server-side. Tracker is supposed to be dummy validator which just checks whether the data looks OK and passes it to Kafka. Any restart or load balancing would also cause issues for that scenarios.

Because of all of mentioned, the only possible solution here is to make the interval configurable.

Btw. internally it uses logarithmic function which prolongs the interval longer your page is opened. After an hour, the update is being sent only once every 90 seconds. https://github.com/remp2020/remp/blob/master/Beam/resources/assets/js/remplib.js#L736

The implemented configuration will therefore change the initial interval and the log function will remain there to keep the interval raising in time.

Here's an example request payload:

{"article":{"id":"368362","author_id":"Marianne Merten","tags":[],"variants":{}},"action":"load","system":{"property_token":"5478a41d-bac1-4679-8a53-4201bb4294f8","time":"2019-07-23T12:15:08.298Z"},"user":{"id":"18510","browser_id":"c8ff2969-ce3f-447c-a579-9bb5e08cd720","url":"https://www.dailymaverick.co.za/article/2019-07-23-the-never-ending-story-of-eskom-bailouts-mboweni-introduces-special-bill-of-billions-more/","referer":"https://www.dailymaverick.co.za/","user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:69.0) Gecko/20100101 Firefox/69.0","adblock":false,"window_height":777,"window_width":1280,"cookies":true,"websockets":true,"source":{},"remp_session_id":"3e5db5a9-6bf6-4be2-8fa9-999be6067d8c","remp_pageview_id":"7bdf244c-012f-4ecf-ab38-63380064eef0"}}

That's 782 chars, according to wc.

Response header is 239B, so 1021B per request and response.

Firefox is reporting around 1.3kb consistently of what it calls "Traffic", which is about 400B mysteriously being used.

Chrome reports around 340B per request.

CURL says: "upload completely sent off: 781 out of 781 bytes" (weirdly a byte short from character count - but could be a changed second digit or something.)

Of course the ISP doesn't care about payload size - it just cares about total traffic, which includes DNS lookups, frame headers, checksum etc.

The technicalities don't really matter. The issue really is that some markets are much more sensitive to bandwidth usage than others, due to income/bandwidth cost inequalities. (In SA, a full day's wages for a domestic worker will not even buy you 200MB out-of-bundle data.)

A configurable interval will help alleviate this issue. I'd be happy if I could start at 10s interval instead of 5, giving up on granularity in favour of less impact on our users. In Europe, it's much less of an issue.

Glad to hear about the logarithmic function!

My bad here, I was counting only payload and completely forgot the headers :). Anyway, I understand the point about the traffic limitations, we'll make the configuration happen.

One more reason why timespent needs to be sent by frontend (for anyone reading this in the future): The timespent timer is paused once the user switches the tab to different one and reenabled when she gets back. This behavior is only observable if frontend JS library handles that, server-side calculation wouldn't be able to include this.

Hey. It's in the master and will be in the tagged version soon. The JS snippet was changed from:

timeSpentEnabled: true // defaults to false

to:

timeSpent: {
    enabled: true, // defaults to false
    interval: 20 // defaults to 5
}